Introduction

Flight delays are a significant societal problem as they impair airlines, transport companies, air traffic controllers, facility managers, and passengers. Studying flight data is an essential activity for every player involved in the air transportation system. Given the uncertainty of delays, passengers usually plan to travel many hours earlier for their appointments. They might have to increase their travel costs to arrive on time. Airlines suffer penalties, fines, and additional operation costs, such as airport crew and aircraft retention. Regarding sustainability, delays may also increase environmental issues due to increased fuel consumption and gas emissions (Rebollo and Balakrishnan 2014) (Sternberg et al., 2017); (Carvalho et al. 2021).

This project aims to generate a prediction model of departure delay time. A reliable prediction will benefit airports, airlines, passengers, and global environmental sustainability.

Data

Our analysis is based on the public flights dataset taken from the “nycflights13” R package. This package contains information about all flights that departed from NYC in 2013. We used additional datasets from the same package, such as ‘planes’, ‘weather’, and ‘airport’. These datasets provide extensive information about the flights. ‘planes’ dataset contains construction information about each plane and was merged with the ‘flights’ dataset by the common variable column “tailnum”. ‘weather’ dataset, which includes hourly meteorological data for each airport, was merged with ‘flights’ according to the “origin” and “time_hour” variable columns. Similarly, the ‘airports’ dataset, which represents all involved airports and their locations, was merged with ‘flights’ by “faa”(=“dest”) variable column. The merged dataset contains 47 variables and 276688 observations.

Load packages and dataset:

 options(repos = list(CRAN="http://cran.rstudio.com/"))
list.of.packages <-
  c(
    "dataPreparation",
    "nycflights13",
    "dplyr",
    "tidyverse",
    "RColorBrewer",
    "ggplot2",
    "lubridate",
    "ROSE",
    "ranger",
    "gridExtra",
    "ranger",
    "ROSE",
    "caret",
    "rpart",
    "rpart.plot",
    "rattle",
    "Metrics",
    "mlr",
    "plotly",
    "caTools",
    "psych"
  )

new.packages <-
  list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages)) {
  install.packages(new.packages)
}

library(dataPreparation)
library(nycflights13)
library(dplyr)
library(tidyverse)
library(RColorBrewer)
library(ggplot2)
library(lubridate)
library(ROSE)
library(ranger)
library(gridExtra)
library(ranger)
library(ROSE)
library(caret)
library(rpart)
library(rattle)
library(rpart.plot)
library(Metrics)
library(mlr)
library(plotly)
library(caTools)
library(psych)
flights <- nycflights13::flights
airports <- nycflights13::airports
planes <- nycflights13::planes
weather <- nycflights13::weather

Merge flights, weather, planes, and airports datasets:

flights_planes <- merge(flights, planes, by = "tailnum")
flights_planes_weather <-
  merge(flights_planes, weather, by = c("origin", "time_hour"))
flights_planes_weather <-
  flights_planes %>% merge(weather, by = c("origin", "time_hour"))
flights_full <-
  merge(flights_planes_weather,
        airports,
        by.x = c("dest"),
        by.y = c("faa"))

dim(flights_full)
## [1] 276688     47
head(flights_full)
##   dest origin           time_hour tailnum year.x month.x day.x dep_time
## 1  ABQ    JFK 2013-07-23 20:00:00  N589JB   2013       7    23     2206
## 2  ABQ    JFK 2013-05-09 20:00:00  N766JB   2013       5     9     2005
## 3  ABQ    JFK 2013-12-19 20:00:00  N519JB   2013      12    19     2102
## 4  ABQ    JFK 2013-06-24 20:00:00  N639JB   2013       6    24     2024
## 5  ABQ    JFK 2013-09-24 20:00:00  N615JB   2013       9    24     1955
## 6  ABQ    JFK 2013-12-24 20:00:00  N630JB   2013      12    24     2001
##   sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
## 1           2007       119      116           2259       137      B6   1505
## 2           2001         4     2249           2308       -19      B6   1505
## 3           2001        61     2358           2304        54      B6     65
## 4           2007        17     2307           2259         8      B6   1505
## 5           2001        -6     2226           2248       -22      B6     65
## 6           2001         0     2304           2304         0      B6     65
##   air_time distance hour.x minute year.y                    type
## 1      253     1826     20      7   2004 Fixed wing multi engine
## 2      259     1826     20      1   2008 Fixed wing multi engine
## 3      265     1826     20      1   2001 Fixed wing multi engine
## 4      232     1826     20      7   2006 Fixed wing multi engine
## 5      243     1826     20      1   2005 Fixed wing multi engine
## 6      275     1826     20      1   2005 Fixed wing multi engine
##       manufacturer    model engines seats speed    engine year month.y day.y
## 1           AIRBUS A320-232       2   200    NA Turbo-fan 2013       7    23
## 2           AIRBUS A320-232       2   200    NA Turbo-fan 2013       5     9
## 3 AIRBUS INDUSTRIE A320-232       2   200    NA Turbo-fan 2013      12    19
## 4           AIRBUS A320-232       2   200    NA Turbo-fan 2013       6    24
## 5           AIRBUS A320-232       2   200    NA Turbo-fan 2013       9    24
## 6           AIRBUS A320-232       2   200    NA Turbo-fan 2013      12    24
##   hour.y  temp  dewp humid wind_dir wind_speed wind_gust precip pressure visib
## 1     20 82.04 71.06 69.43      230   11.50780        NA      0   1002.3    10
## 2     20 57.92 53.96 86.63      210    6.90468        NA      0   1013.9    10
## 3     20 42.98 28.04 55.27      250    6.90468        NA      0   1020.5    10
## 4     20 80.96 64.94 58.25      250    9.20624        NA      0   1015.6    10
## 5     20 68.00 41.00 37.32      280   10.35702        NA      0   1012.6    10
## 6     20 32.00 19.94 60.66      310   21.86482   28.7695      0   1025.0    10
##                                name      lat       lon  alt tz dst
## 1 Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
## 2 Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
## 3 Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
## 4 Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
## 5 Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
## 6 Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
##            tzone
## 1 America/Denver
## 2 America/Denver
## 3 America/Denver
## 4 America/Denver
## 5 America/Denver
## 6 America/Denver

Exploratory data analysis (EDA) and Pre-Proccessing Analysis

In order to successfully predict a delayed flight we would like to use the most affecting features/variables in our model and avoid features that may increase noise. Achieving this goal requires pre-processing steps in which we reduce the number of variables and the number of levels/categories in variables.

First, we plotted numeric variables vs. departure delay in order to visually spot if there is a linear relationship between them. On top of the scatter plot there is a linear model fit line:

plot_var_vs_dep_delay <-
  function(var_name, df = numeric_flights_full){
    df <- df[sample(nrow(df), 10000), ]
     p<-ggplot(data = df, mapping = aes(x = get(var_name), y = dep_delay)) +
      geom_point(alpha = 0.2) +
      geom_smooth(method = "lm",level=0.2) +
      labs(
        x = sprintf("%s", var_name),
        y = "Departure delay (min)",
        title = sprintf("%s vs departure delay", var_name))
    return(p)
}

numeric_vars <- c(
  'sched_dep_time',
  'dep_delay',
  'arr_delay',
  'distance',
  'air_time',
  'temp',
  'wind_dir',
  'wind_speed',
  'wind_gust',
  "precip",
  "pressure",
  "visib",
  "humid",
  "dewp"
)

numeric_flights_full <- flights_full %>% select(all_of(numeric_vars))

plot_list <- lapply(numeric_vars, plot_var_vs_dep_delay)

grid.arrange(grobs = plot_list, ncol = 3)

We would like to examine a correlation between numeric variables. In order to calculate Pearson correlation, certain assumptions must be met, one of them is that there must be a linear relationship between the variables. Observing the scatter plots above, we can see there is no linear relationship between those numeric variables. Thus, we chose to apply Spearman correlation test, which does not require linear relationship assumptions.

Spearman correlation test between numerical variables in the datase:

cor_test_mat <- corr.test(numeric_flights_full, method = "spearman")  
#r
cor_test_mat$r
##                sched_dep_time    dep_delay   arr_delay     distance
## sched_dep_time    1.000000000  0.242468155  0.15972763 -0.020968959
## dep_delay         0.242468155  1.000000000  0.63416915  0.076474466
## arr_delay         0.159727631  0.634169152  1.00000000 -0.071903136
## distance         -0.020968959  0.076474466 -0.07190314  1.000000000
## air_time         -0.021874600  0.079279091 -0.02003161  0.984448446
## temp              0.078878376  0.046728973 -0.02567579  0.006721989
## wind_dir          0.009809660 -0.009695863 -0.01368203  0.001601394
## wind_speed        0.123476957  0.065576107  0.07031103  0.019372379
## wind_gust         0.050621861  0.036945207  0.06842991  0.030620264
## precip            0.009062127  0.116261731  0.15317115  0.003132605
## pressure         -0.074564174 -0.113104964 -0.12835539  0.008417160
## visib             0.082449488 -0.082294763 -0.12681222 -0.009263453
## humid            -0.159960170  0.086862748  0.11285397  0.023404245
## dewp             -0.008916716  0.080997447  0.02851065  0.016728963
##                    air_time         temp     wind_dir  wind_speed   wind_gust
## sched_dep_time -0.021874600  0.078878376  0.009809660  0.12347696  0.05062186
## dep_delay       0.079279091  0.046728973 -0.009695863  0.06557611  0.03694521
## arr_delay      -0.020031610 -0.025675795 -0.013682031  0.07031103  0.06842991
## distance        0.984448446  0.006721989  0.001601394  0.01937238  0.03062026
## air_time        1.000000000 -0.049792971  0.011814677  0.02951930  0.04791494
## temp           -0.049792971  1.000000000 -0.144816796 -0.11791494 -0.35254088
## wind_dir        0.011814677 -0.144816796  1.000000000  0.34195792  0.13169708
## wind_speed      0.029519300 -0.117914944  0.341957922  1.00000000  0.87662980
## wind_gust       0.047914943 -0.352540879  0.131697082  0.87662980  1.00000000
## precip          0.019925058 -0.065799680 -0.083363231  0.04379486  0.08782057
## pressure       -0.001528471 -0.239281149 -0.186149803 -0.20058037 -0.21646913
## visib          -0.024508213  0.054669345  0.223136647  0.12964654 -0.10737224
## humid           0.028202000  0.045535829 -0.351682726 -0.21874997 -0.01924615
## dewp           -0.030345953  0.881335226 -0.279280924 -0.19406358 -0.30536777
##                      precip     pressure        visib       humid         dewp
## sched_dep_time  0.009062127 -0.074564174  0.082449488 -0.15996017 -0.008916716
## dep_delay       0.116261731 -0.113104964 -0.082294763  0.08686275  0.080997447
## arr_delay       0.153171154 -0.128355394 -0.126812225  0.11285397  0.028510646
## distance        0.003132605  0.008417160 -0.009263453  0.02340424  0.016728963
## air_time        0.019925058 -0.001528471 -0.024508213  0.02820200 -0.030345953
## temp           -0.065799680 -0.239281149  0.054669345  0.04553583  0.881335226
## wind_dir       -0.083363231 -0.186149803  0.223136647 -0.35168273 -0.279280924
## wind_speed      0.043794856 -0.200580367  0.129646536 -0.21874997 -0.194063580
## wind_gust       0.087820566 -0.216469127 -0.107372244 -0.01924615 -0.305367772
## precip          1.000000000 -0.081572193 -0.480288221  0.37760322  0.102317758
## pressure       -0.081572193  1.000000000  0.128824903 -0.13973204 -0.267869766
## visib          -0.480288221  0.128824903  1.000000000 -0.56241399 -0.197036048
## humid           0.377603217 -0.139732043 -0.562413993  1.00000000  0.494198023
## dewp            0.102317758 -0.267869766 -0.197036048  0.49419802  1.000000000
#p value
cor_test_mat$p
##                sched_dep_time     dep_delay     arr_delay      distance
## sched_dep_time   0.000000e+00  0.000000e+00  0.000000e+00  5.412243e-27
## dep_delay        0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
## arr_delay        0.000000e+00  0.000000e+00  0.000000e+00 1.878316e-306
## distance         2.706121e-28  0.000000e+00 4.268899e-308  0.000000e+00
## air_time         4.124446e-30  0.000000e+00  1.621108e-25  0.000000e+00
## temp             0.000000e+00 1.439739e-131  7.639191e-41  4.065868e-04
## wind_dir         3.512590e-07  5.840768e-07  1.942815e-12  4.056914e-01
## wind_speed       0.000000e+00 2.529843e-257 1.365972e-294  2.203719e-24
## wind_gust        7.008941e-39  4.669118e-21  5.446890e-68  3.120584e-15
## precip           1.871206e-06  0.000000e+00  0.000000e+00  9.939732e-02
## pressure        1.573520e-300  0.000000e+00  0.000000e+00  2.953352e-05
## visib            0.000000e+00  0.000000e+00  0.000000e+00  1.100475e-06
## humid            0.000000e+00  0.000000e+00  0.000000e+00  7.787442e-35
## dewp             2.728860e-06  0.000000e+00  5.923647e-50  1.368616e-18
##                     air_time          temp      wind_dir    wind_speed
## sched_dep_time  8.661337e-29  0.000000e+00  3.863849e-06  0.000000e+00
## dep_delay       0.000000e+00 5.039086e-130  5.840768e-06 1.011937e-255
## arr_delay       3.080106e-24  1.986190e-39  2.525659e-11 5.737081e-293
## distance        0.000000e+00  1.626347e-03  8.113828e-01  3.746322e-23
## air_time        0.000000e+00 4.464886e-147  1.461914e-08  5.964947e-52
## temp           1.240246e-148  0.000000e+00  0.000000e+00  0.000000e+00
## wind_dir        1.218262e-09  0.000000e+00  0.000000e+00  0.000000e+00
## wind_speed      2.056878e-53  0.000000e+00  0.000000e+00  0.000000e+00
## wind_gust       3.391162e-34  0.000000e+00 1.678106e-253  0.000000e+00
## precip          2.905854e-25 4.814770e-263  0.000000e+00 1.663411e-117
## pressure        4.515856e-01  0.000000e+00  0.000000e+00  0.000000e+00
## visib           2.283766e-37 4.125182e-182  0.000000e+00  0.000000e+00
## humid           6.461952e-49 6.597392e-127  0.000000e+00  0.000000e+00
## dewp            2.344350e-56  0.000000e+00  0.000000e+00  0.000000e+00
##                    wind_gust        precip      pressure         visib
## sched_dep_time  1.752235e-37  1.309844e-05 6.766138e-299  0.000000e+00
## dep_delay       7.470589e-20  0.000000e+00  0.000000e+00  0.000000e+00
## arr_delay       1.688536e-66  0.000000e+00  0.000000e+00  0.000000e+00
## distance        4.368817e-14  2.981919e-01  1.476676e-04  8.803803e-06
## air_time        7.460556e-33  5.230537e-24  8.113828e-01  5.481038e-36
## temp            0.000000e+00 1.974056e-261  0.000000e+00 1.567569e-180
## wind_dir       6.544614e-252  0.000000e+00  0.000000e+00  0.000000e+00
## wind_speed      0.000000e+00 5.489257e-116  0.000000e+00  0.000000e+00
## wind_gust       0.000000e+00 3.903508e-112  0.000000e+00 1.238188e-167
## precip         1.219846e-113  0.000000e+00  0.000000e+00  0.000000e+00
## pressure        0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
## visib          3.346453e-169  0.000000e+00  0.000000e+00  0.000000e+00
## humid           7.196687e-07  0.000000e+00  0.000000e+00  0.000000e+00
## dewp            0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
##                        humid         dewp
## sched_dep_time  0.000000e+00 1.637316e-05
## dep_delay       0.000000e+00 0.000000e+00
## arr_delay       0.000000e+00 1.658621e-48
## distance        1.791112e-33 2.052924e-17
## air_time        1.744727e-47 7.033051e-55
## temp           2.243113e-125 0.000000e+00
## wind_dir        0.000000e+00 0.000000e+00
## wind_speed      0.000000e+00 0.000000e+00
## wind_gust       6.477018e-06 0.000000e+00
## precip          0.000000e+00 0.000000e+00
## pressure        0.000000e+00 0.000000e+00
## visib           0.000000e+00 0.000000e+00
## humid           0.000000e+00 0.000000e+00
## dewp            0.000000e+00 0.000000e+00

Plot of correlation heatmap:

#plot the correlation matrix
cor.plot(cor_test_mat$r, main = "Correlation Matrix (spearman)", stars = TRUE, xlas = 2)

The variables ‘dewp’ and ‘temp’ are highly correlated (0.9), so are ‘wind_speed’ and ‘wind_gust’ (0.87), and ‘air_time’ and ‘distance’ (0.98).The correlation is significant (pval<0.05) We decided to remove ‘dewp’, ‘air_time’, and ‘wind_gust’ to avoid double variables in the model. All three variables are logically an outcome of the variable they are highly correlated to.

We edited some of the columns in our dataset in order to improve their potential contribution to the prediction:

Convert hours column to minutes

flights_full <-
  flights_full %>% mutate(
    sched_dep_time = hour.x * 60 + minute,
    sched_arr_time = floor(sched_arr_time / 100) *
      60 + sched_arr_time %% 100
  )

Convert days to week days

flights_full <- flights_full %>%
  mutate(w_day = wday(time_hour, label = TRUE))

Convert days to weeks (52 weeks per year)

flights_full <-
  flights_full %>% mutate(week_num = (year(time_hour) - year(min(time_hour))) * 52 +
                            week(time_hour) - week(min(time_hour)))

Convert wind direction from degrees to 16 compass directions

directions <- read.csv('../output/wind_directions.csv')

flights_full <- flights_full %>%
  mutate(wind_dir = cut(
    as.numeric(wind_dir),
    breaks = c(0, directions$degree_max, 360),
    labels = c(directions$cardinal, 'N')
  ))
#note: when the wind direction is 0 degrees the wind direction is NA.
#Also, the wind speed is 0.
#This is correct because if the wind is not moving then it does not have a direction.

Remove NA from dep_delay

flights_full <- flights_full %>% drop_na(dep_delay)

Remove sd outliers

flights_full <-
  remove_sd_outlier(flights_full,
                    cols = "dep_delay",
                    n_sigmas = 7,
                    verbose = TRUE)
## [1] "remove_sd_outlier: I start to filter categorical rare events"
## [1] "remove_sd_outlier: dropped 548 row(s) that are rare event on dep_delay."
## [1] "remove_sd_outlier: 548 have been dropped. It took 0.05 seconds. "

Convert the classes of part of the columns

flights_full <- transform(
  flights_full,
  origin = as.factor(origin),
  carrier = as.factor(carrier),
  tzone = as.factor(tzone),
  type = as.factor(type),
  model = as.factor(model),
  engine = as.factor(engine),
  hour.y = as.numeric(hour.y),
  hour.x = as.numeric(hour.x),
  manufacturer = as.factor(manufacturer),
  month.x = ordered(as.factor(month.x)),
  month.y = ordered(as.factor(month.y)),
  dest= as.factor(dest)
)

Remove identical and constant variables

flights_full <- fast_filter_variables(
  flights_full,
  level = 2,
  keep_cols = NULL,
  verbose = TRUE
)
## [1] "fast_filter_variables: I check for constant columns."
## [1] "fast_filter_variables: I delete 2 constant column(s) in data_set."
## [1] "fast_filter_variables: I check for columns in double."
## [1] "fast_filter_variables: I delete 3 column(s) that are in double in data_set."

Suspected columns to have mostly NAs

length(which(is.na(flights_full$speed)))
## [1] 271073
length(which(is.na(flights_full$wind_gust)))
## [1] 207178

Remove irrelevant columns

flights_full <-
  select(
    flights_full,-c(
      time_hour,
      arr_delay,
      flight,
      tailnum,
      arr_time,
      dep_time,
      name,
      lat,
      lon,
      alt,
      tz,
      dst,
      speed,
      wind_gust,
      dewp,
      hour.x,
      minute,
      day.x,
      air_time
    )
  )

Let’s take a look at the data:

summary(flights_full)
##       dest        origin          month.x       sched_dep_time  
##  LAX    : 15368   EWR:110190   10     : 23965   Min.   : 300.0  
##  ATL    : 14320   JFK: 88371   8      : 23847   1st Qu.: 545.0  
##  BOS    : 13534   LGA: 73419   5      : 23608   Median : 835.0  
##  MCO    : 13061                7      : 23563   Mean   : 814.4  
##  SFO    : 12639                3      : 23106   3rd Qu.:1049.0  
##  CLT    : 12064                4      : 23058   Max.   :1425.0  
##  (Other):190994                (Other):130833                   
##    dep_delay     sched_arr_time      carrier         distance        year.y    
##  Min.   :-43.0   Min.   :   1.0   UA     :55437   Min.   :  80   Min.   :1956  
##  1st Qu.: -5.0   1st Qu.: 681.0   EV     :51015   1st Qu.: 529   1st Qu.:1999  
##  Median : -1.0   Median : 955.0   B6     :49385   Median : 888   Median :2002  
##  Mean   : 12.5   Mean   : 934.9   DL     :46007   Mean   :1062   Mean   :2001  
##  3rd Qu.: 11.0   3rd Qu.:1188.0   US     :19758   3rd Qu.:1400   3rd Qu.:2006  
##  Max.   :298.0   Max.   :1439.0   9E     :17310   Max.   :4983   Max.   :2013  
##                                   (Other):33068                  NA's   :5054  
##                        type                               manufacturer  
##  Fixed wing multi engine :270051   BOEING                       :79441  
##  Fixed wing single engine:  1552   EMBRAER                      :63272  
##  Rotorcraft              :   377   AIRBUS                       :43742  
##                                    AIRBUS INDUSTRIE             :40039  
##                                    BOMBARDIER INC               :27416  
##                                    MCDONNELL DOUGLAS AIRCRAFT CO: 8766  
##                                    (Other)                      : 9304  
##              model           engines          seats      
##  A320-232       : 41865   Min.   :1.000   Min.   :  2.0  
##  EMB-145LR      : 26431   1st Qu.:2.000   1st Qu.: 55.0  
##  ERJ 190-100 IGW: 23280   Median :2.000   Median :149.0  
##  737-824        : 13320   Mean   :1.994   Mean   :136.4  
##  EMB-145XR      : 13299   3rd Qu.:2.000   3rd Qu.:189.0  
##  CL-600-2D24    : 11666   Max.   :4.000   Max.   :450.0  
##  (Other)        :142119                                  
##            engine            temp            humid           wind_dir     
##  4 Cycle      :    37   Min.   : 10.94   Min.   : 12.74   W      : 28945  
##  Reciprocating:  1648   1st Qu.: 42.08   1st Qu.: 43.75   S      : 28309  
##  Turbo-fan    :230495   Median : 57.92   Median : 57.33   NW     : 26301  
##  Turbo-jet    : 39380   Mean   : 57.11   Mean   : 59.29   SW     : 20188  
##  Turbo-prop   :    43   3rd Qu.: 71.96   3rd Qu.: 74.80   WNW    : 18852  
##  Turbo-shaft  :   377   Max.   :100.04   Max.   :100.00   (Other):132408  
##                         NA's   :15       NA's   :15       NA's   : 16977  
##    wind_speed         precip            pressure          visib       
##  Min.   : 0.000   Min.   :0.000000   Min.   : 983.8   Min.   : 0.000  
##  1st Qu.: 6.905   1st Qu.:0.000000   1st Qu.:1012.9   1st Qu.:10.000  
##  Median :10.357   Median :0.000000   Median :1017.6   Median :10.000  
##  Mean   :10.993   Mean   :0.004304   Mean   :1017.9   Mean   : 9.287  
##  3rd Qu.:13.809   3rd Qu.:0.000000   3rd Qu.:1022.9   3rd Qu.:10.000  
##  Max.   :42.579   Max.   :1.210000   Max.   :1042.1   Max.   :10.000  
##  NA's   :69                          NA's   :29023                    
##                  tzone        w_day          week_num    
##  America/Anchorage  :     6   Sun:37893   Min.   : 0.00  
##  America/Chicago    : 55657   Mon:40906   1st Qu.:13.00  
##  America/Denver     :  9837   Tue:40352   Median :26.00  
##  America/Los_Angeles: 43201   Wed:40641   Mean   :25.74  
##  America/New_York   :158016   Thu:40400   3rd Qu.:39.00  
##  America/Phoenix    :  4562   Fri:40416   Max.   :51.00  
##  Pacific/Honolulu   :   701   Sat:31372

Dividing Departure Delay Variable into 2 Categories - ‘delay’ vs. ‘no delay’

Since we would like to predict if a particular flight will be delayed, we logically selected a threshold indicating a significant delay on time - 20 minutes.

Flights that depart earlier than 10 minutes before their scheduled time (departure delay <= -10) were removed from the following analysis since they may confuse our model and add noise, as we aim to predict positive delay. Those flights with departure delay <= (-10) should be included in the analysis in case of prediction of negative delay.

Here we assume that a flight is considered as delayed only if its departure time was delayed by 20 minutes or more. A flight is considered as departed on time, i.e, no delay, if its departure time is in the range of 10 minutes earlier and until 20 minutes late (not including 20 minutes late).

Divide the flights into 2 groups according to their dep_delay - flights above 20 min delay, and flights above -10 & until 20 min delay

flights_full_new_dep_delay <-
  flights_full[which(flights_full$dep_delay > -10),]
flights_full_arranged <-
  flights_full_new_dep_delay %>% arrange(dep_delay)

# plot histogram of original dep_delay before changing it into 2 categories with the threshold
ggplot(flights_full_arranged, aes(x = dep_delay)) +
  geom_histogram(color = "black", fill = "white", bins = 40) +
  geom_vline(aes(xintercept = 20, color = "delay > 20 min"),
             linetype = "dashed",
             size = 1.3) +
  
  scale_color_manual(name = "Treshold delay time", values = c("delay > 20 min" = "red")) +
  
  labs(
    x = "Departure delay time [min]",
    y = "counts of flights",
    title = paste('Histogram of departure delay time')
  ) +
  theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))

Percentage of dep_delay=0

(length(which(flights_full_arranged$dep_delay < (20)))) / nrow(flights_full_arranged)
## [1] 0.7971997

Percentage of dep_delay=1

(length(which(flights_full_arranged$dep_delay >= (20)))) / nrow(flights_full_arranged)
## [1] 0.2028003

Change dep_delay column into categories (0 / 1)

flights_full_arranged <-
  
  flights_full_arranged %>% mutate(dep_delay = case_when(dep_delay < 20 ~ 0,
                                                         dep_delay >= 20 ~ 1))

Convert dep_delay to factor column

flights_full_arranged$dep_delay <-
  as.factor(flights_full_arranged$dep_delay)

Plot flights counts per 2 dep_delay categories - add legend

ggplot(flights_full_arranged, aes(dep_delay, fill = dep_delay)) + geom_bar(stat="count") + 
  scale_fill_manual(labels=c("0 = no delay", "1 = delay"), values = c('#CC6666', '#FFCCCC')) +
  labs(title = "Flights counts per departure delay category", x = "Departure delay categories") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

str(flights_full_arranged)
## Classes 'data.table' and 'data.frame':   262613 obs. of  25 variables:
##  $ dest          : Factor w/ 100 levels "ABQ","ACK","ALB",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ origin        : Factor w/ 3 levels "EWR","JFK","LGA": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month.x       : Ord.factor w/ 12 levels "1"<"2"<"3"<"4"<..: 5 9 10 10 11 7 9 9 8 7 ...
##  $ sched_dep_time: num  1201 1201 1199 1199 1200 ...
##  $ dep_delay     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sched_arr_time: num  1388 1368 1366 1366 1383 ...
##  $ carrier       : Factor w/ 16 levels "9E","AA","AS",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ distance      : num  1826 1826 1826 1826 1826 ...
##  $ year.y        : int  2004 2004 2006 2004 2011 2005 2011 2011 2005 2006 ...
##  $ type          : Factor w/ 3 levels "Fixed wing multi engine",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ manufacturer  : Factor w/ 35 levels "AGUSTA SPA","AIRBUS",..: 2 2 2 2 2 18 18 18 18 18 ...
##  $ model         : Factor w/ 127 levels "150","172E","172M",..: 89 89 89 89 89 109 109 109 109 109 ...
##  $ engines       : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ seats         : int  200 200 200 200 200 20 20 20 20 20 ...
##  $ engine        : Factor w/ 6 levels "4 Cycle","Reciprocating",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp          : num  57.9 63 60.1 53.1 52 ...
##  $ humid         : num  69.5 70.1 74.6 48.5 36.3 ...
##  $ wind_dir      : Factor w/ 16 levels "N","NNE","NE",..: 9 4 9 11 13 11 15 14 2 10 ...
##  $ wind_speed    : num  6.9 3.45 11.51 20.71 18.41 ...
##  $ precip        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pressure      : num  1021 1025 1009 1014 1017 ...
##  $ visib         : num  10 10 10 10 10 10 10 10 10 9 ...
##  $ tzone         : Factor w/ 7 levels "America/Anchorage",..: 3 3 3 3 3 5 5 5 5 5 ...
##  $ w_day         : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 2 7 7 7 1 3 5 1 1 3 ...
##  $ week_num      : num  20 38 41 42 44 29 35 37 31 27 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Combine ‘manufacturer’ and ‘model’ Variables

We wanted to check whether the variable ‘model’ is a unique variable. That is, there are no model names that appear under two different manufacturers. Our examination revealed a small number of models whose names are not unique and appear under two different manufacturers. To overcome this problem, we decided to combine ‘manufacturer’ and ‘model’ columns into one column named ‘manu_model’ with the aim of reducing the amount of variables in the dataset. The manu-model column contains unique names for each model because each model is coupled to its manufacturer’s name.

Later we checked whether this variable has priority over the manufacturer variable, and we saw that the variable is equivalent and, in some cases, even takes on higher importance in the models we run. Therefore, we will delete the ‘model’ and ‘manufacturer’ columns and leave the ‘manu-model’ column containing both information.

Show in how many different manufacturers each model presented:

manufacturer_model_df<-flights_full_arranged %>% group_by(manufacturer, model) %>% summarise(counts=n())
table_manufacturer_model<-table(manufacturer_model_df$model)
table_manufacturer_model[table_manufacturer_model>1] # models' prevalence in different manufacturers
## 
##    A319-112    A319-114    A319-131    A319-132    A320-211    A320-212 
##           2           2           2           2           2           2 
##    A320-214    A320-232    A321-211    A321-231    A330-223 CL-600-2B19 
##           2           2           2           2           2           2 
##   FALCON-XP   FALCON XP       MD-88    MD-90-30 
##           3           5           2           2

Combine manufacturer and model columns into one column:

manu_model <-
  paste(flights_full_arranged$manufacturer,
        flights_full_arranged$model,
        sep = "_")
flights_full_arranged$manu_model <- manu_model

Plots of normalized delayed flights counts per variable’s categories

In order to visually check whether a variable demonstrates variability in delayed flights (in cases when dep_delay=1), we created a bar plot for each variable separately while considering the number of flights in each category. Thus, for each category in a given variable we calculated the relative amount of delayed flights out of total flights in the current category. Due to this calculation, we can compare the categories.

flights_full_arranged_factors_all <-
  data.frame(lapply(flights_full_arranged, factor))
str(flights_full_arranged_factors_all)
## 'data.frame':    262613 obs. of  26 variables:
##  $ dest          : Factor w/ 100 levels "ABQ","ACK","ALB",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ origin        : Factor w/ 3 levels "EWR","JFK","LGA": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month.x       : Ord.factor w/ 12 levels "1"<"2"<"3"<"4"<..: 5 9 10 10 11 7 9 9 8 7 ...
##  $ sched_dep_time: Factor w/ 1011 levels "300","301","305",..: 871 871 869 869 870 399 389 389 150 150 ...
##  $ dep_delay     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sched_arr_time: Factor w/ 1102 levels "1","2","3","4",..: 1051 1031 1029 1029 1046 459 442 442 212 212 ...
##  $ carrier       : Factor w/ 16 levels "9E","AA","AS",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ distance      : Factor w/ 205 levels "80","94","96",..: 179 179 179 179 179 14 14 14 14 14 ...
##  $ year.y        : Factor w/ 46 levels "1956","1959",..: 37 37 39 37 44 38 44 44 38 39 ...
##  $ type          : Factor w/ 3 levels "Fixed wing multi engine",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ manufacturer  : Factor w/ 35 levels "AGUSTA SPA","AIRBUS",..: 2 2 2 2 2 18 18 18 18 18 ...
##  $ model         : Factor w/ 127 levels "150","172E","172M",..: 89 89 89 89 89 109 109 109 109 109 ...
##  $ engines       : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 2 2 ...
##  $ seats         : Factor w/ 48 levels "2","4","5","6",..: 34 34 34 34 34 13 13 13 13 13 ...
##  $ engine        : Factor w/ 6 levels "4 Cycle","Reciprocating",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp          : Factor w/ 167 levels "10.94","12.02",..: 101 112 105 85 82 143 132 122 123 132 ...
##  $ humid         : Factor w/ 2431 levels "12.74","13","13.95",..: 1807 1822 1938 1152 672 1808 1141 923 1179 1988 ...
##  $ wind_dir      : Factor w/ 16 levels "N","NNE","NE",..: 9 4 9 11 13 11 15 14 2 10 ...
##  $ wind_speed    : Factor w/ 34 levels "0","3.45234",..: 5 2 9 17 15 10 11 13 6 7 ...
##  $ precip        : Factor w/ 55 levels "0","0.01","0.02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pressure      : Factor w/ 453 levels "983.8","985",..: 258 296 141 193 215 85 203 127 252 237 ...
##  $ visib         : Factor w/ 20 levels "0","0.06","0.12",..: 20 20 20 20 20 20 20 20 20 19 ...
##  $ tzone         : Factor w/ 7 levels "America/Anchorage",..: 3 3 3 3 3 5 5 5 5 5 ...
##  $ w_day         : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 2 7 7 7 1 3 5 1 1 3 ...
##  $ week_num      : Factor w/ 52 levels "0","1","2","3",..: 21 39 42 43 45 30 36 38 32 28 ...
##  $ manu_model    : Factor w/ 147 levels "AGUSTA SPA_A109E",..: 23 23 23 23 23 122 122 122 122 122 ...
gplot_var <- function(var_name, convert_x, palette_a = FALSE) {
  count_df_var <-
    flights_full_arranged_factors_all %>% group_by(!!(sym(var_name))) %>% summarize(total_count =
                                                                                      n())
  df_var <-
    flights_full_arranged_factors_all %>% group_by(!!(sym(var_name)), dep_delay) %>% tally() %>% filter(dep_delay ==
                                                                                                          1)
  df_var <- merge(df_var, count_df_var, by = var_name)
  df_var <- df_var %>% mutate(relative_count_1 = n / total_count)
  varX <- paste(var_name, ".x", sep = "")
  colnames(df_var)[which(colnames(df_var) == varX)] <- var_name
  g <-
    ggplot(
      data = df_var,
      mapping = aes_string(x = var_name, y = "relative_count_1", fill = var_name)
    ) + geom_bar(stat = "identity", show.legend = FALSE) +
    labs(
      fill = var_name,
      x = var_name,
      y = 'Normalized count delayed flights',
      title = paste("Normalized count delayed flights per category in", var_name)
    )
  if (convert_x == TRUE) {
    t <- theme(
      plot.title = element_text(
        hjust = 0.5,
        size = 19,
        face = "bold"
      ),
      axis.text.x = element_text(
        angle = 90,
        vjust = 0.5,
        hjust = 1,
        size = 6
      )
    )
  } else {
    t <- theme(
      plot.title = element_text(
        hjust = 0.5,
        size = 19,
        face = "bold"
      ),
      axis.text.x = element_text(size = 6)
    )
  }
  if (!(palette_a == FALSE)) {
    color <- scale_fill_brewer(palette = palette_a)
    g + t + color
  } else {
    g + t
  }
}

Destination

# plot normalized delayed flights in destination variable
dest_plot<-gplot_var("dest", convert_x=TRUE)                                                                        
dest_plot

Origin Airport

# plot normalized delayed flights in origin variable
origin_plot<-gplot_var("origin", convert_x=FALSE, palette_a="PuRd")                                                                    
origin_plot

Month

# plot normalized delayed flights in month variable
colnames(flights_full_arranged_factors_all)[which(colnames(flights_full_arranged_factors_all)=="month.x")]<-"month"
month_plot<-gplot_var("month", convert_x=FALSE)                                                                        
month_plot

Carrier (airline)

# plot normalized delayed flights in carrier variable
carrier_plot<-gplot_var("carrier", convert_x=FALSE)                                                                        
carrier_plot

Year of Manufacturer

# plot normalized delayed flights in year variable
colnames(flights_full_arranged_factors_all)[which(colnames(flights_full_arranged_factors_all)=="year.y")]<-"year"
year_plot<-gplot_var("year", convert_x=TRUE)                                                                        
year_plot

Engine Type

# plot normalized delayed flights in type variable
type_plot<-gplot_var("type", convert_x=FALSE, palette_a="Oranges")                                                                    
type_plot

Manufacture

# plot normalized delayed flights in manufacturer variable
manufacturer_plot<-gplot_var("manufacturer", convert_x=TRUE)                                                                        
manufacturer_plot

Model

# plot normalized delayed flights in model variable
model_plot<-gplot_var("model", convert_x=TRUE)                                                                        
model_plot

Number of Engines

# plot normalized delayed flights in engines variable
engines_plot<-gplot_var("engines", convert_x=FALSE, palette_a="BuGn")                                                                  
engines_plot

Visibility

# plot normalized delayed flights in visib variable
visib_plot<-gplot_var("visib", convert_x=FALSE)                                                                        
visib_plot

Time Zone

# plot normalized delayed flights in tzone variable
tzone_plot<-gplot_var("tzone", convert_x=TRUE, palette_a="RdPu")                                                                    
tzone_plot

Week Day

# plot normalized delayed flights in w_day variable
w_day_plot<-gplot_var("w_day", convert_x=FALSE, palette_a="YlGn")                                                                     
w_day_plot

Week Number

# plot normalized delayed flights in week_num variable
week_num_plot<-gplot_var("week_num", convert_x=FALSE)                                                                        
week_num_plot

Manu- Model

# plot normalized delayed flights in manu_model variable
manu_model_plot<-gplot_var("manu_model", convert_x=TRUE)                                                                        
manu_model_plot

Plot Multiple Variables

We have seen that overall there is not one particular variable that has an apparent direct relationship to the departure delay. The following plot aims to examine the option of multiple variables together having a more apparent relation to the departure delay.

wind_dir_df<- flights_full_arranged %>% group_by(origin, wind_dir) %>% summarize(total_count=n(), mean_speed = mean(wind_speed))
df_wind<-flights_full_arranged %>% group_by(wind_dir, dep_delay) %>% tally() %>% filter(dep_delay==1, !is.na(wind_dir)) 
df_wind<-merge(df_wind, wind_dir_df, by= 'wind_dir')
df_wind <- df_wind %>% mutate(relative_count_1=n/total_count)


ggplot(df_wind, aes(x = wind_dir, y = relative_count_1, fill = mean_speed)) +
  geom_col(binwidth = 15,
           boundary = -7.5,
           position = 'stack') +
  coord_polar(theta = "x", direction = 2) +
  facet_wrap( ~ origin, nrow = 1) +
  labs(x = "Wind Direction",
       y = 'normalized delay (num delay/total flights)',
       title = "Departure Delay and Wind Speed per Wind Direction and Origin",
       fill = 'Ave. Wind Speed [mph]') +
  theme(plot.title = element_text(face = "bold.italic", hjust = 0.5)) +
  scale_fill_distiller(palette = 'YlOrRd', direction = 1)

Categories Reduction in Factorial Variables using Permutation Test

In order to optimize our model, we decreased the number of categories (levels) in factorial variables that are identified with a large number of categories. This step was performed using a statistical permutation test on each variable separately. Permutation tests are non-parametric and only require the assumption of exchangeability, regardless of whether or not the distribution of the data is known. Our assumption is that in a particular variable, a category that belongs to a delayed flight, not by chance, should be considered individually in our model since it can indicate the possibility of delay of a future flight related to this category. However, a category that repeats on random realizations with the same or even larger prevalence of delayed flights will probably fail to reject the null hypothesis that delayed flights are not over-represented in the current category in the original dataset. Thus, the insignificant categories of a variable should be combined into one aggregated category.

We performed 2000 realizations of the original dataset: During each realization, we randomly shuffled the departure time delay (‘dep_delay’) column and counted how many flights labeled as delayed (with dep_delay=1) for each category in each variable separately. Next, we compared it to the original count of departure delayed flights in the particular category. The P-value for a given category was calculated as the number of realizations when the count of departure delayed flights was equal to or even higher than the original delayed flights count in this category (plus one), divided by the total number of realizations. For example, a permutation analysis p-value of 1e-3 (in case of 1000 total realizations) means that the null hypothesis was rejected in all the realizations, suggesting that the specific category is significant in distinguishing delayed flights in the original dataset, using a confidence of 0.005. Finally, the permutation analysis p-value was corrected for multiple testing using the Benjamini-Hochberg procedure.

The P-value for a given category was calculated as the number of realizations when the count of departure delayed flights was equal to or higher than the original delayed flights count in this category (plus one), divided by the total number of realizations. For example, a permutation analysis p-value of 1e-3 (in case of 1000 total realizations) means that the null hypothesis was rejected in all the realizations, suggesting that the specific category is significant in distinguishing delayed flights in the original dataset, using a confidence of 0.005. Finally, the permutation analysis p-value was corrected for multiple testing using the Benjamini-Hochberg procedure.

We applied this step to several variables: * “model” variable, which included 127 categories, and after the permutation test had only 13 categories. * “manufacturer” variable, which initially had 35 categories and finally had 4. * “manu_model” included 147 categories at the beginning and 12 at the end of the permutation test. * “seats” included 48 categories at the beginning and 8 at the end of the permutation test. * “dest” variable included at first 100 categories, and after the permutation test had only 52 categories.

We provide in this project the complete output data frames observed from the permutation analysis. There are 5 data frames - each variable has its own data frame that includes the number of realizations (plus one) in which delayed flight counts were equal to or higher than the original counts for each category. We visualized the results for each variable in bar plots representing p-value adjusted values (after Benjamini-Hochberg correction) of each category. A line representing FDR=0.05 is presented on the plot as well in order to emphasize cases (categories) when the null hypothesis was rejected. In addition, we generated plots of total flights in each category to compare p-value results to the total amount of flights in each category.

# permutation function
levels_model <-
  levels(factor(flights_full_arranged$model)) #levels of model
levels_manufacturer <-
  levels(factor(flights_full_arranged$manufacturer)) #levels of manufacturer
levels_manu_model <-
  levels(factor(flights_full_arranged$manu_model)) #levels of manu_model
levels_dest <-
  levels(factor(flights_full_arranged$dest)) #levels of destinations
levels_seats <-
  levels(factor(flights_full_arranged$seats)) #levels of seats


run_permutations = FALSE

if (run_permutations) {
  origin_num_delay_var <- function(var_name, levels_var) {
    origin_delay_vec <-
      sapply(
        levels_var,
        simplify = TRUE,
        FUN = function(one_level) {
          df_summarize <-
            flights_full_arranged %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
                                                                                            FALSE) %>% tally
          number_delay_in_level <-
            df_summarize$n[df_summarize$dep_delay == 1]
          number_delay_in_level
        }
      )
    origin_delay_vec
  }
  
  origin_num_delay_model <-
    origin_num_delay_var('model', levels_model)
  origin_num_delay_manufacturer <-
    origin_num_delay_var('manufacturer', levels_manufacturer)
  origin_num_delay_manu_model <-
    origin_num_delay_var('manu_model', levels_manu_model)
  origin_num_delay_dest <- origin_num_delay_var('dest', levels_dest)
  origin_num_delay_seats <-
    origin_num_delay_var('seats', levels_seats)
  
  
  perm_vec_var <- function(levels_var_model,
                           origin_num_delay_var) {
    perm_vec <- rep(1, length(levels_var_model))
    perm_vec <- setNames(perm_vec, names(origin_num_delay_var))
  }
  
  perm_ndelay_vec_model <-
    perm_vec_var(levels_model, origin_num_delay_model)
  perm_ndelay_vec_manufacturer <-
    perm_vec_var(levels_manufacturer, origin_num_delay_manufacturer)
  perm_ndelay_vec_manu_model <-
    perm_vec_var(levels_manu_model, origin_num_delay_manu_model)
  perm_ndelay_vec_dest <-
    perm_vec_var(levels_dest, origin_num_delay_dest)
  perm_ndelay_vec_seats <-
    perm_vec_var(levels_seats, origin_num_delay_seats)
  
  create_perm_ndelay_vec_var <-
    function(flights_full_perm,
             var_name,
             levels_var,
             origin_num_delay_var,
             perm_vec_var) {
      perm_num_delay_var <-
        sapply(
          levels_var,
          simplify = TRUE,
          FUN = function(one_level) {
            df_summarize <-
              flights_full_perm %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
                                                                                          FALSE) %>% tally
            number_delay_in_level <-
              df_summarize$n[df_summarize$dep_delay == 1]
            number_delay_in_level
          }
        )
      ind_greater <- which(perm_num_delay_var >= origin_num_delay_var)
      perm_vec_var[ind_greater] <- perm_vec_var[ind_greater] + 1
      perm_vec_var
    }
  
  num_perm <- 2000
  for (iter in 1:num_perm) {
    print(iter)
    flights_full_perm <-
      transform(flights_full_arranged, dep_delay = sample(dep_delay))
    perm_ndelay_vec_model <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'model',
        levels_model,
        origin_num_delay_model,
        perm_ndelay_vec_model
      )
    perm_ndelay_vec_manufacturer <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'manufacturer',
        levels_manufacturer,
        origin_num_delay_manufacturer,
        perm_ndelay_vec_manufacturer
      )
    perm_ndelay_vec_manu_model <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'manu_model',
        levels_manu_model,
        origin_num_delay_manu_model,
        perm_ndelay_vec_manu_model
      )
    perm_ndelay_vec_dest <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'dest',
        levels_dest,
        origin_num_delay_dest,
        perm_ndelay_vec_dest
      )
    perm_ndelay_vec_seats <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'seats',
        levels_seats,
        origin_num_delay_seats,
        perm_ndelay_vec_seats
      )
  }
  
  #save perm_ndelay_vec_var output from permutations to csv (as tibble data frmae)
  perm_ndelay_vec_model_df = tibble(name = names(perm_ndelay_vec_model), value = perm_ndelay_vec_model)
  write.table(
    perm_ndelay_vec_model_df ,
    file = "../output/perm_ndelay_vec_model_df.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_manufacturer_df = tibble(name = names(perm_ndelay_vec_manufacturer),
                                           value = perm_ndelay_vec_manufacturer)
  write.table(
    perm_ndelay_vec_manufacturer_df ,
    file = "../output/perm_ndelay_vec_manufacturer_df.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_manu_model_df = tibble(name = names(perm_ndelay_vec_manu_model),
                                         value = perm_ndelay_vec_manu_model)
  write.table(
    perm_ndelay_vec_manu_model_df ,
    file = "../output/perm_ndelay_vec_manu_model_df.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_dest_df = tibble(name = names(perm_ndelay_vec_dest), value = perm_ndelay_vec_dest)
  write.table(
    perm_ndelay_vec_dest_df ,
    file = "../output/perm_ndelay_vec_dest_df.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_seats_df = tibble(name = names(perm_ndelay_vec_seats), value = perm_ndelay_vec_seats)
  write.table(
    perm_ndelay_vec_seats_df ,
    file = "../output/perm_ndelay_vec_seats_df.csv",
    sep = ",",
    row.names = FALSE
  )
  
  
}

Load output data frames of permutation test and display plots of p-val and of flights count for each level

num_perm <- 2000
perm_result_var_df <- function(var_name) {
  perm_ndelay_var_df <-
    read.table(
      file = paste(
        "../output/perm_ndelay_merged_vec_",
        var_name,
        "_df.csv",
        sep = ""
      ),
      sep = ",",
      header = TRUE
    )
  colnames(perm_ndelay_var_df)[1] <- var_name
  perm_ndelay_var_df$p_val <- perm_ndelay_var_df$value / num_perm
  perm_ndelay_var_df$p_adj <-
    p.adjust(perm_ndelay_var_df$p_val, method = "BH")
  
  flights_counts_var_df <-
    flights_full_arranged %>% group_by((!!sym(var_name))) %>% summarise(total_counts =
                                                                          n())
  perm_ndelay_var_df <-
    merge(perm_ndelay_var_df, flights_counts_var_df, by = var_name) #add flights counts per level
  
  perm_ndelay_var_df <- perm_ndelay_var_df %>% arrange(p_adj)
  
  p_val_plot <-
    ggplot(perm_ndelay_var_df, aes(x = reorder((!!sym(
      var_name
    )), p_adj) , y = p_adj)) + geom_bar(stat = "identity") +
    labs(
      title = paste("p-value adjusted per", var_name, "level"),
      x = var_name,
      y = "p-value adjusted"
    ) +
    theme(
      plot.title = element_text(
        hjust = 0.5,
        size = 19,
        face = "bold"
      ),
      axis.text.x = element_text(
        angle = 90,
        vjust = 0.5,
        hjust = 1,
        size = 6
      )
    ) +
    geom_hline(aes(yintercept = 0.05, color = "0.05"),
               linetype = "dashed",
               size = 1.5) +
    scale_color_manual(name = "FDR cutoff", values = c("0.05" = "red"))
  
  flights_counts_plot <-
    ggplot(perm_ndelay_var_df,  aes(x = reorder((!!sym(
      var_name
    )), p_adj) , y = total_counts)) +
    geom_bar(stat = "identity") + labs(
      title = paste("flights counts per", var_name, "level"),
      x = var_name,
      y = "flights counts"
    ) +
    theme(
      plot.title = element_text(
        hjust = 0.5,
        size = 19,
        face = "bold"
      ),
      axis.text.x = element_text(
        angle = 90,
        vjust = 0.5,
        hjust = 1,
        size = 6
      )
    )
  
  return(list(perm_ndelay_var_df, p_val_plot, flights_counts_plot))
}

Permutation - model

perm_model_df <- perm_result_var_df("model")
perm_ndelay_model_df <- perm_model_df[[1]]
perm_model_df[[2]]

perm_model_df[[3]]

Permutation - manufacturer

perm_manufacturer_df <- perm_result_var_df("manufacturer")
perm_ndelay_manufacturer_df <- perm_manufacturer_df[[1]]
perm_manufacturer_df[[2]]

perm_manufacturer_df[[3]]

Permutation - manu_model

perm_manu_model_df <- perm_result_var_df("manu_model")
perm_ndelay_manu_model_df <- perm_manu_model_df[[1]]
perm_manu_model_df[[2]]

perm_manu_model_df[[3]]

Permutation - dest

perm_dest_df <- perm_result_var_df("dest")
perm_ndelay_dest_df <- perm_dest_df[[1]]
perm_dest_df[[2]]

perm_dest_df[[3]]

Permutation - seats

perm_seats_df <- perm_result_var_df("seats")
perm_ndelay_seats_df <- perm_seats_df[[1]]
perm_seats_df[[2]]

perm_seats_df[[3]]

Following the permutation test and its visualization, in each variable column (5 variables), we replaced the original levels (categories) with new levels. Those who passed the Benjamini-Hochberg multiple test correction (FDR<0.05) were kept, while the insignificant categories were converted to a single new category named ‘other’. Categories in ‘model’, ‘manufacturer’, and ‘manu_model’ variables were changed to one-letter levels, as well as ‘dest’ variable, whose categories were replaced with numbers for visualization improvement purposes.

Change model levels in flights_full_arranged

flights_full_arranged$model <-
  as.character(flights_full_arranged$model)
models_to_change <-
  perm_ndelay_model_df$model[which(perm_ndelay_model_df$p_adj <= 0.05)] #filtered levels (were kept)
models_to_change
##  [1] "717-200"     "737-7H4"     "757-324"     "CL-600-2B19" "CL-600-2C10"
##  [6] "CL-600-2D24" "EMB-145"     "EMB-145LR"   "EMB-145XR"   "A319-115"   
## [11] "777-224"     "737-7BD"
flights_full_arranged <-
  flights_full_arranged %>% mutate(model = replace(
    model,!(flights_full_arranged$model %in% models_to_change),
    'other'
  ))

flights_full_arranged$model <-
  as.factor(flights_full_arranged$model) #convert back to factor variable
levels(flights_full_arranged$model)
##  [1] "717-200"     "737-7BD"     "737-7H4"     "757-324"     "777-224"    
##  [6] "A319-115"    "CL-600-2B19" "CL-600-2C10" "CL-600-2D24" "EMB-145"    
## [11] "EMB-145LR"   "EMB-145XR"   "other"

Change model levels into one-letter categories

model_levels_after_change <-
  data.frame(model = levels(flights_full_arranged$model))
model_levels_after_change$new_name <-
  letters[1:nrow(model_levels_after_change)]
model_levels_after_change$new_name[model_levels_after_change$model == "other"] <-
  "other"
model_letter_cat <- lapply(flights_full_arranged$model, function(row) {
  y <-
    model_levels_after_change$new_name[which(row == model_levels_after_change$model)]
  y
})
flights_full_arranged$model <- as.factor(unlist(model_letter_cat))
levels(flights_full_arranged$model)
##  [1] "a"     "b"     "c"     "d"     "e"     "f"     "g"     "h"     "i"    
## [10] "j"     "k"     "l"     "other"

Change manufacturer levels in flights_full_arranged

flights_full_arranged$manufacturer <-
  as.character(flights_full_arranged$manufacturer)
manufacturers_to_change <-
  perm_ndelay_manufacturer_df$manufacturer[which(perm_ndelay_manufacturer_df$p_adj <=
                                                   0.05)] #filtered levels (were kept)
manufacturers_to_change
## [1] "BOMBARDIER INC" "CANADAIR"       "EMBRAER"
flights_full_arranged <-
  flights_full_arranged %>% mutate(manufacturer = replace(
    manufacturer,!(
      flights_full_arranged$manufacturer %in% manufacturers_to_change
    ),
    'other'
  ))

flights_full_arranged$manufacturer <-
  as.factor(flights_full_arranged$manufacturer) #convert back to factor variable
levels(flights_full_arranged$manufacturer)
## [1] "BOMBARDIER INC" "CANADAIR"       "EMBRAER"        "other"

Change manufacturer levels into one-letter categories

manufacturer_levels_after_change <-
  data.frame(manufacturer = levels(flights_full_arranged$manufacturer))
manufacturer_levels_after_change$new_name <-
  letters[1:nrow(manufacturer_levels_after_change)]
manufacturer_levels_after_change$new_name[manufacturer_levels_after_change$manufacturer == "other"] <-
  "other"
manufacturer_letter_cat <- lapply(flights_full_arranged$manufacturer, function(row) {
  y <-
    manufacturer_levels_after_change$new_name[which(row == manufacturer_levels_after_change$manufacturer)]
  y
})
flights_full_arranged$manufacturer <- as.factor(unlist(manufacturer_letter_cat))
levels(flights_full_arranged$manufacturer)
## [1] "a"     "b"     "c"     "other"

Change manu_model levels in flights_full_arranged

flights_full_arranged$manu_model <-
  as.character(flights_full_arranged$manu_model)
manu_models_to_change <-
  perm_ndelay_manu_model_df$manu_model[which(perm_ndelay_manu_model_df$p_adj <=
                                               0.05)] #filtered levels (were kept)
manu_models_to_change
##  [1] "BOEING_717-200"             "BOEING_737-7H4"            
##  [3] "BOEING_757-324"             "BOMBARDIER INC_CL-600-2B19"
##  [5] "BOMBARDIER INC_CL-600-2C10" "BOMBARDIER INC_CL-600-2D24"
##  [7] "CANADAIR_CL-600-2B19"       "EMBRAER_EMB-145"           
##  [9] "EMBRAER_EMB-145LR"          "EMBRAER_EMB-145XR"         
## [11] "AIRBUS_A319-115"            "BOEING_777-224"
flights_full_arranged <-
  
  flights_full_arranged %>% mutate(manu_model = replace(
    manu_model,!(flights_full_arranged$manu_model %in% manu_models_to_change),
    'other'
  ))

flights_full_arranged$manu_model <-
  as.factor(flights_full_arranged$manu_model) #convert back to factor variable
levels(flights_full_arranged$manu_model)
##  [1] "AIRBUS_A319-115"            "BOEING_717-200"            
##  [3] "BOEING_737-7H4"             "BOEING_757-324"            
##  [5] "BOEING_777-224"             "BOMBARDIER INC_CL-600-2B19"
##  [7] "BOMBARDIER INC_CL-600-2C10" "BOMBARDIER INC_CL-600-2D24"
##  [9] "CANADAIR_CL-600-2B19"       "EMBRAER_EMB-145"           
## [11] "EMBRAER_EMB-145LR"          "EMBRAER_EMB-145XR"         
## [13] "other"

Change manu_model column into one-letter categories

flights_full_arranged_bf_change <- flights_full_arranged
manu_model_levels_after_change <-
  data.frame(manu_model = levels(flights_full_arranged$manu_model))
manu_model_levels_after_change$new_name <-
  letters[1:nrow(manu_model_levels_after_change)]
manu_model_levels_after_change$new_name[manu_model_levels_after_change$manu_model ==
                                          "other"] <-
  "other"
manu_model_letter_cat <-
  lapply(flights_full_arranged$manu_model, function(row) {
    y <-
      manu_model_levels_after_change$new_name[which(row == manu_model_levels_after_change$manu_model)]
    y
  })
flights_full_arranged$manu_model <-
  as.factor(unlist(manu_model_letter_cat))
levels(flights_full_arranged$manu_model)
##  [1] "a"     "b"     "c"     "d"     "e"     "f"     "g"     "h"     "i"    
## [10] "j"     "k"     "l"     "other"

Change destination levels in flights_full_arranged

flights_full_arranged$dest <-
  as.character(flights_full_arranged$dest)
dests_to_change <-
  perm_ndelay_dest_df$dest[which(perm_ndelay_dest_df$p_adj <= 0.05)] #filtered levels (were kept)
dests_to_change
##  [1] "ALB" "BGR" "BHM" "BNA" "BTV" "BWI" "CAE" "CAK" "CLE" "CVG" "DAY" "DSM"
## [13] "GRR" "GSO" "GSP" "IAD" "ILM" "JAX" "MCI" "MDW" "MEM" "MHT" "MKE" "MSN"
## [25] "OKC" "OMA" "ORF" "PVD" "PWM" "RDU" "RIC" "ROC" "SAT" "SDF" "STL" "TUL"
## [37] "TYS" "BDL" "CHS" "SMF" "SAV" "PDX" "JAC" "SYR" "TVC" "ABQ" "PIT" "CHO"
## [49] "IND" "DEN" "BUF"
flights_full_arranged <-
  flights_full_arranged %>% mutate(dest = replace(
    dest,!(flights_full_arranged$dest %in% dests_to_change),
    'other'
  ))

flights_full_arranged$dest <-
  as.factor(flights_full_arranged$dest) #convert back to factor variable
levels(flights_full_arranged$dest)
##  [1] "ABQ"   "ALB"   "BDL"   "BGR"   "BHM"   "BNA"   "BTV"   "BUF"   "BWI"  
## [10] "CAE"   "CAK"   "CHO"   "CHS"   "CLE"   "CVG"   "DAY"   "DEN"   "DSM"  
## [19] "GRR"   "GSO"   "GSP"   "IAD"   "ILM"   "IND"   "JAC"   "JAX"   "MCI"  
## [28] "MDW"   "MEM"   "MHT"   "MKE"   "MSN"   "OKC"   "OMA"   "ORF"   "other"
## [37] "PDX"   "PIT"   "PVD"   "PWM"   "RDU"   "RIC"   "ROC"   "SAT"   "SAV"  
## [46] "SDF"   "SMF"   "STL"   "SYR"   "TUL"   "TVC"   "TYS"

Change destination column into number categories

dest_levels_after_change <-
  data.frame(dest = levels(flights_full_arranged$dest))
dest_levels_after_change$new_name <-
  1:nrow(dest_levels_after_change)
dest_number_cat <-
  lapply(flights_full_arranged$dest, function(row) {
    y <- which(row == dest_levels_after_change$dest)
    y
  })

flights_full_arranged$dest <- as.factor(unlist(dest_number_cat))
levels(flights_full_arranged$dest)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
## [46] "46" "47" "48" "49" "50" "51" "52"

Change seats levels in flights_full_arranged

flights_full_arranged$seats <-
  as.character(flights_full_arranged$seats)
seats_to_change <-
  perm_ndelay_seats_df$seats[which(perm_ndelay_seats_df$p_adj <= 0.05)] #filtered levels (were kept)
seats_to_change
## [1]  55  80  95 100 140 147 275
flights_full_arranged <-
  flights_full_arranged %>% mutate(seats = replace(
    seats,!(flights_full_arranged$seats %in% seats_to_change),
    'other'
  ))

flights_full_arranged$seats <-
  as.factor(flights_full_arranged$seats) #convert back to factor variable
levels(flights_full_arranged$seats)
## [1] "100"   "140"   "147"   "275"   "55"    "80"    "95"    "other"

Sensitivity Check

In order to check the sensitivity of our prediction models to the threshold we chose to distinguish between delayed and non-delayed flights, we will generate two additional datasets, one with a higher threshold - 25 minutes, and one with a lower threshold - 15 minutes. Some of the pre-processing depends on the labels assigned to each flight (0- non-delay, 1- delay). Therefore, we will perform the necessary pre-processing steps to ensure that the data is coherent.

Threshold = 15 minutes

#new data frame for sensitivity check, threshold = 15
flights_full_15 <- flights_full

#divide the flights into 2 groups according to their dep_delay - flights above 15 min delay, and flights above -10 & until 15 min delay

flights_full_new_dep_delay_15 <-
  flights_full_15[which(flights_full_15$dep_delay > -10), ]
flights_full_arranged_15 <-
  flights_full_new_dep_delay_15 %>% arrange(dep_delay)


# plot histogram of original dep_delay before changing it into 2 categories with the threshold
ggplot(flights_full_arranged_15, aes(x = dep_delay)) +
  geom_histogram(color = "black", fill = "white", bins = 40) +
  geom_vline(aes(xintercept = 15, color = "delay > 15 min"),
             linetype = "dashed",
             size = 1.3) +
  scale_color_manual(name = "Treshold delay time", values = c("delay > 15 min" = "red")) +
  
  labs(
    x = "Departure delay time [min]",
    y = "counts of flights",
    title = paste('Histogram of departure delay time')
  ) +
  theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))

# percentage of dep_delay=0
(length(which(
  flights_full_arranged_15$dep_delay < (15)
))) / nrow(flights_full_arranged_15)
## [1] 0.7662873
# percentage of dep_delay=1
(length(which(
  flights_full_arranged_15$dep_delay >= (15)
))) / nrow(flights_full_arranged_15)
## [1] 0.2337127
#change dep_delay column into categories (0 / 1)
flights_full_arranged_15 <-
  
  flights_full_arranged_15 %>% mutate(dep_delay = case_when(dep_delay < 15 ~ 0,
                                                            dep_delay >= 15 ~ 1))


#convert dep_delay to factor column
flights_full_arranged_15$dep_delay <-
  as.factor(flights_full_arranged_15$dep_delay)


#plot flights counts per 2 dep_delay categories
ggplot(flights_full_arranged_15, aes(dep_delay, fill = dep_delay)) +  geom_bar(stat="count") + 
  scale_fill_manual(labels=c("0 = no delay", "1 = delay"), values = c('#CC6666', '#FFCCCC')) +
  labs(title = "Flights counts per departure delay category (15)", x = "Departure delay categories") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# merge manufacturer and model columns into one column
manu_model <-
  paste(flights_full_arranged_15$manufacturer,
        flights_full_arranged_15$model,
        sep = "_")
flights_full_arranged_15$manu_model <- manu_model


# permutation function - skip it because it is already done
levels_model <-
  levels(factor(flights_full_arranged_15$model)) #levels of model
levels_manufacturer <-
  levels(factor(flights_full_arranged_15$manufacturer)) #levels of manufacturer
levels_manu_model <-
  levels(factor(flights_full_arranged_15$manu_model)) #levels of manu_model
levels_dest <-
  levels(factor(flights_full_arranged_15$dest)) #levels of destinations
levels_seats <-
  levels(factor(flights_full_arranged_15$seats)) #levels of seats

run_permutations = FALSE

if (run_permutations) {
  origin_num_delay_var <- function(var_name, levels_var) {
    origin_delay_vec <-
      sapply(
        levels_var,
        simplify = TRUE,
        FUN = function(one_level) {
          df_summarize <-
            flights_full_arranged_15 %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
                                                                                               FALSE) %>% tally
          number_delay_in_level <-
            df_summarize$n[df_summarize$dep_delay == 1]
          number_delay_in_level
        }
      )
    origin_delay_vec
  }
  
  origin_num_delay_model <-
    origin_num_delay_var('model', levels_model)
  origin_num_delay_manufacturer <-
    origin_num_delay_var('manufacturer', levels_manufacturer)
  origin_num_delay_manu_model <-
    origin_num_delay_var('manu_model', levels_manu_model)
  origin_num_delay_dest <- origin_num_delay_var('dest', levels_dest)
  origin_num_delay_seats <-
    origin_num_delay_var('seats', levels_seats)
  
  
  perm_vec_var <- function(levels_var_model,
                           origin_num_delay_var) {
    perm_vec <- rep(1, length(levels_var_model))
    perm_vec <- setNames(perm_vec, names(origin_num_delay_var))
  }
  
  perm_ndelay_vec_model <-
    perm_vec_var(levels_model, origin_num_delay_model)
  perm_ndelay_vec_manufacturer <-
    perm_vec_var(levels_manufacturer, origin_num_delay_manufacturer)
  perm_ndelay_vec_manu_model <-
    perm_vec_var(levels_manu_model, origin_num_delay_manu_model)
  perm_ndelay_vec_dest <-
    perm_vec_var(levels_dest, origin_num_delay_dest)
  perm_ndelay_vec_seats <-
    perm_vec_var(levels_seats, origin_num_delay_seats)
  
  create_perm_ndelay_vec_var <-
    function(flights_full_perm,
             var_name,
             levels_var,
             origin_num_delay_var,
             perm_vec_var) {
      perm_num_delay_var <-
        sapply(
          levels_var,
          simplify = TRUE,
          FUN = function(one_level) {
            df_summarize <-
              flights_full_perm %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
                                                                                          FALSE) %>% tally
            number_delay_in_level <-
              df_summarize$n[df_summarize$dep_delay == 1]
            number_delay_in_level
          }
        )
      ind_greater <- which(perm_num_delay_var >= origin_num_delay_var)
      perm_vec_var[ind_greater] <- perm_vec_var[ind_greater] + 1
      perm_vec_var
    }
  
  num_perm <- 1000
  for (iter in 1:num_perm) {
    print(iter)
    flights_full_perm <-
      transform(flights_full_arranged_15, dep_delay = sample(dep_delay))
    perm_ndelay_vec_model <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'model',
        levels_model,
        origin_num_delay_model,
        perm_ndelay_vec_model
      )
    perm_ndelay_vec_manufacturer <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'manufacturer',
        levels_manufacturer,
        origin_num_delay_manufacturer,
        perm_ndelay_vec_manufacturer
      )
    perm_ndelay_vec_manu_model <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'manu_model',
        levels_manu_model,
        origin_num_delay_manu_model,
        perm_ndelay_vec_manu_model
      )
    perm_ndelay_vec_dest <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'dest',
        levels_dest,
        origin_num_delay_dest,
        perm_ndelay_vec_dest
      )
    perm_ndelay_vec_seats <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'seats',
        levels_seats,
        origin_num_delay_seats,
        perm_ndelay_vec_seats
      )
  }
  
  #save perm_ndelay_vec_var output from permutations to csv (as tibble data frmae)
  perm_ndelay_vec_model_df = tibble(name = names(perm_ndelay_vec_model), value = perm_ndelay_vec_model)
  write.table(
    perm_ndelay_vec_model_df ,
    file = "../output/perm_ndelay_merged_vec_model_df_15.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_manufacturer_df = tibble(name = names(perm_ndelay_vec_manufacturer),
                                           value = perm_ndelay_vec_manufacturer)
  write.table(
    perm_ndelay_vec_manufacturer_df ,
    file = "../output/perm_ndelay_merged_vec_manufacturer_df_15.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_manu_model_df = tibble(name = names(perm_ndelay_vec_manu_model),
                                         value = perm_ndelay_vec_manu_model)
  write.table(
    perm_ndelay_vec_manu_model_df ,
    file = "../output/final_project/perm_ndelay_merged_vec_manu_model_df_15.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_dest_df = tibble(name = names(perm_ndelay_vec_dest), value = perm_ndelay_vec_dest)
  write.table(
    perm_ndelay_vec_dest_df ,
    file = "../output/perm_ndelay_merged_vec_dest_df_15.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_seats_df = tibble(name = names(perm_ndelay_vec_seats), value = perm_ndelay_vec_seats)
  write.table(
    perm_ndelay_vec_seats_df ,
    file = "../output/perm_ndelay_merged_vec_seats_df_15.csv",
    sep = ",",
    row.names = FALSE
  )

}


#display plots of p-val for each level and flights count for each level
num_perm <- 1000
perm_result_var_df <- function(var_name) {
  perm_ndelay_var_df <-
    read.table(
      file = paste(
        "../output/perm_ndelay_merged_vec_",
        var_name,
        "_df_15.csv",
        sep = ""
      ),
      sep = ",",
      header = TRUE
    )
  colnames(perm_ndelay_var_df)[1] <- var_name
  perm_ndelay_var_df$p_val <- perm_ndelay_var_df$value / num_perm
  perm_ndelay_var_df$p_adj <-
    p.adjust(perm_ndelay_var_df$p_val, method = "BH")
  
  flights_counts_var_df <-
    flights_full_arranged_15 %>% group_by((!!sym(var_name))) %>% summarise(total_counts =
                                                                             n())
  perm_ndelay_var_df <-
    merge(perm_ndelay_var_df, flights_counts_var_df, by = var_name) #add flights counts per level
  
  perm_ndelay_var_df <- perm_ndelay_var_df %>% arrange(p_adj)
  
  p_val_plot <-
    ggplot(perm_ndelay_var_df, aes(x = reorder((!!sym(
      var_name
    )), p_adj) , y = p_adj)) + geom_bar(stat = "identity") +
    labs(
      title = paste("p-value adjusted per", var_name, "level"),
      x = var_name,
      y = "p-value adjusted"
    ) +
    theme(
      plot.title = element_text(
        hjust = 0.5,
        size = 19,
        face = "bold"
      ),
      axis.text.x = element_text(
        angle = 90,
        vjust = 0.5,
        hjust = 1,
        size = 6
      )
    ) +
    geom_hline(aes(yintercept = 0.05, color = "0.05"),
               linetype = "dashed",
               size = 1.5) +
    scale_color_manual(name = "FDR cutoff", values = c("0.05" = "red"))
  
  flights_counts_plot <-
    ggplot(perm_ndelay_var_df,  aes(x = reorder((!!sym(
      var_name
    )), p_adj) , y = total_counts)) +
    geom_bar(stat = "identity") + labs(
      title = paste("flights counts per", var_name, "level"),
      x = var_name,
      y = "flights counts"
    ) +
    theme(
      plot.title = element_text(
        hjust = 0.5,
        size = 19,
        face = "bold"
      ),
      axis.text.x = element_text(
        angle = 90,
        vjust = 0.5,
        hjust = 1,
        size = 6
      )
    )
  
  return(list(perm_ndelay_var_df, p_val_plot, flights_counts_plot))
}

#permutation - model
perm_model_df <- perm_result_var_df("model")
perm_ndelay_model_df <- perm_model_df[[1]]



#permutation - manufacturer
perm_manufacturer_df <- perm_result_var_df("manufacturer")
perm_ndelay_manufacturer_df <- perm_manufacturer_df[[1]]


#permutation - manu_model
perm_manu_model_df <- perm_result_var_df("manu_model")
perm_ndelay_manu_model_df <- perm_manu_model_df[[1]]


#permutation - dest
perm_dest_df <- perm_result_var_df("dest")
perm_ndelay_dest_df <- perm_dest_df[[1]]


#permutation - seats
perm_seats_df <- perm_result_var_df("seats")
perm_ndelay_seats_df <- perm_seats_df[[1]]

#change model levels in flights_full_arranged_15
flights_full_arranged_15$model <-
  as.character(flights_full_arranged_15$model)
models_to_change <-
  perm_ndelay_model_df$model[which(perm_ndelay_model_df$p_adj <= 0.05)] #filtered levels (were kept)

flights_full_arranged_15 <-
  flights_full_arranged_15 %>% mutate(model = replace(
    model,
    !(flights_full_arranged_15$model %in% models_to_change),
    'other_model'
  ))

flights_full_arranged_15$model <-
  as.factor(flights_full_arranged_15$model) #convert back to factor variable


#change model levels into  one-letter categories
flights_full_arranged_15_bf_change <- flights_full_arranged_15
model_levels_after_change <-
  data.frame(model = levels(flights_full_arranged_15$model))
model_levels_after_change$new_name <-
  letters[1:nrow(model_levels_after_change)]
model_levels_after_change$new_name[model_levels_after_change$model == "other_model"] <-
  "other"
model_letter_cat <-
  lapply(flights_full_arranged_15$model, function(row) {
    y <-
      model_levels_after_change$new_name[which(row == model_levels_after_change$model)]
    y
  })
flights_full_arranged_15$model <- as.factor(unlist(model_letter_cat))



#change manufacturer levels in flights_full_arranged_15
flights_full_arranged_15$manufacturer <-
  as.character(flights_full_arranged_15$manufacturer)
manufacturers_to_change <-
  perm_ndelay_manufacturer_df$manufacturer[which(perm_ndelay_manufacturer_df$p_adj <=
                                                   0.05)] #filtered levels (were kept)

flights_full_arranged_15 <-
  flights_full_arranged_15 %>% mutate(manufacturer = replace(
    manufacturer,
    !(
      flights_full_arranged_15$manufacturer %in% manufacturers_to_change
    ),
    'other_manufacturer'
  ))

flights_full_arranged_15$manufacturer <-
  as.factor(flights_full_arranged_15$manufacturer) #convert back to factor variable


#change manu_model levels in flights_full_arranged_15
flights_full_arranged_15$manu_model <-
  as.character(flights_full_arranged_15$manu_model)
manu_models_to_change <-
  perm_ndelay_manu_model_df$manu_model[which(perm_ndelay_manu_model_df$p_adj <=
                                               0.05)] #filtered levels (were kept)

flights_full_arranged_15 <-
  
  flights_full_arranged_15 %>% mutate(manu_model = replace(
    manu_model,
    !(
      flights_full_arranged_15$manu_model %in% manu_models_to_change
    ),
    'other_manu_model'
  ))

flights_full_arranged_15$manu_model <-
  as.factor(flights_full_arranged_15$manu_model) #convert back to factor variable


#change manu_model column into  one-letter categories
flights_full_arranged_15_bf_change <- flights_full_arranged_15
manu_model_levels_after_change <-
  data.frame(manu_model = levels(flights_full_arranged_15$manu_model))
manu_model_levels_after_change$new_name <-
  letters[1:nrow(manu_model_levels_after_change)]
manu_model_levels_after_change$new_name[manu_model_levels_after_change$manu_model ==
                                          "other_manu_model"] <- "other"
manu_model_letter_cat <-
  lapply(flights_full_arranged_15$manu_model, function(row) {
    y <-
      manu_model_levels_after_change$new_name[which(row == manu_model_levels_after_change$manu_model)]
    y
  })
flights_full_arranged_15$manu_model <-
  as.factor(unlist(manu_model_letter_cat))


#change destination levels in flights_full_arranged_15
flights_full_arranged_15$dest <-
  as.character(flights_full_arranged_15$dest)
dests_to_change <-
  perm_ndelay_dest_df$dest[which(perm_ndelay_dest_df$p_adj <= 0.05)] #filtered levels (were kept)

flights_full_arranged_15 <-
  flights_full_arranged_15 %>% mutate(dest = replace(
    dest,
    !(flights_full_arranged_15$dest %in% dests_to_change),
    'other_dest'
  ))

flights_full_arranged_15$dest <-
  as.factor(flights_full_arranged_15$dest) #convert back to factor variable



#change destination column into number categories
dest_levels_after_change <-
  data.frame(dest = levels(flights_full_arranged_15$dest))
dest_levels_after_change$new_name <- 1:nrow(dest_levels_after_change)
dest_number_cat <-
  lapply(flights_full_arranged_15$dest, function(row) {
    y <- which(row == dest_levels_after_change$dest)
    y
  })
flights_full_arranged_15$dest <- as.factor(unlist(dest_number_cat))


#change seats levels in flights_full_arranged_15
flights_full_arranged_15$seats <-
  as.character(flights_full_arranged_15$seats)
seats_to_change <-
  perm_ndelay_seats_df$seats[which(perm_ndelay_seats_df$p_adj <= 0.05)] #filtered levels (were kept)

flights_full_arranged_15 <-
  flights_full_arranged_15 %>% mutate(seats = replace(
    seats,
    !(flights_full_arranged_15$seats %in% seats_to_change),
    'other_seats'
  ))

flights_full_arranged_15$seats <-
  as.factor(flights_full_arranged_15$seats) #convert back to factor variable

Threshold = 25 minutes

flights_full_25 <- flights_full

#divide the flights into 2 groups according to their dep_delay - flights above 25 min delay, and flights above -10 & until 25 min delay

flights_full_new_dep_delay_25 <-
  flights_full_25[which(flights_full_25$dep_delay > -10), ]
flights_full_arranged_25 <-
  flights_full_new_dep_delay_25 %>% arrange(dep_delay)


# plot histogram of original dep_delay before changing it into 2 categories with the threshold
ggplot(flights_full_arranged_25, aes(x = dep_delay)) +
  geom_histogram(color = "black", fill = "white", bins = 40) +
  geom_vline(aes(xintercept = 25, color = "delay > 25 min"),
             linetype = "dashed",
             size = 1.3) +
  
  scale_color_manual(name = "Treshold delay time", values = c("delay > 25 min" = "red")) +
  
  labs(
    x = "Departure delay time [min]",
    y = "counts of flights",
    title = paste('Histogram of departure delay time')
  ) +
  theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))

# percentage of dep_delay=0
(length(which(
  flights_full_arranged_25$dep_delay < (25)
))) / nrow(flights_full_arranged_25)
## [1] 0.8218672
# percentage of dep_delay=1
(length(which(
  flights_full_arranged_25$dep_delay >= (25)
))) / nrow(flights_full_arranged_25)
## [1] 0.1781328
#change dep_delay column into categories (0 / 1)
flights_full_arranged_25 <-
  
  flights_full_arranged_25 %>% mutate(dep_delay = case_when(dep_delay < 25 ~ 0,
                                                            dep_delay >= 25 ~ 1))


#convert dep_delay to factor column
flights_full_arranged_25$dep_delay <-
  as.factor(flights_full_arranged_25$dep_delay)


#plot flights counts per 2 dep_delay categories
ggplot(flights_full_arranged_25, aes(dep_delay, fill = dep_delay)) + geom_bar(stat="count") + 
  scale_fill_manual(labels=c("0 = no delay", "1 = delay"), values = c('#CC6666', '#FFCCCC')) +
  labs(title = "Flights counts per departure delay category (25)", x = "Departure delay categories") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# merge manufacturer and model columns into one column
manu_model <-
  paste(flights_full_arranged_25$manufacturer,
        flights_full_arranged_25$model,
        sep = "_")
flights_full_arranged_25$manu_model <- manu_model


# permutation function - skip it because it is already done
levels_model <-
  levels(factor(flights_full_arranged_25$model)) #levels of model
levels_manufacturer <-
  levels(factor(flights_full_arranged_25$manufacturer)) #levels of manufacturer
levels_manu_model <-
  levels(factor(flights_full_arranged_25$manu_model)) #levels of manu_model
levels_dest <-
  levels(factor(flights_full_arranged_25$dest)) #levels of destinations
levels_seats <-
  levels(factor(flights_full_arranged_25$seats)) #levels of seats

run_permutations = FALSE

if (run_permutations) {
  origin_num_delay_var <- function(var_name, levels_var) {
    origin_delay_vec <-
      sapply(
        levels_var,
        simplify = TRUE,
        FUN = function(one_level) {
          df_summarize <-
            flights_full_arranged_25 %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
                                                                                               FALSE) %>% tally
          number_delay_in_level <-
            df_summarize$n[df_summarize$dep_delay == 1]
          number_delay_in_level
        }
      )
    origin_delay_vec
  }
  
  origin_num_delay_model <-
    origin_num_delay_var('model', levels_model)
  origin_num_delay_manufacturer <-
    origin_num_delay_var('manufacturer', levels_manufacturer)
  origin_num_delay_manu_model <-
    origin_num_delay_var('manu_model', levels_manu_model)
  origin_num_delay_dest <- origin_num_delay_var('dest', levels_dest)
  origin_num_delay_seats <-
    origin_num_delay_var('seats', levels_seats)
  
  
  perm_vec_var <- function(levels_var_model,
                           origin_num_delay_var) {
    perm_vec <- rep(1, length(levels_var_model))
    perm_vec <- setNames(perm_vec, names(origin_num_delay_var))
  }
  
  perm_ndelay_vec_model <-
    perm_vec_var(levels_model, origin_num_delay_model)
  perm_ndelay_vec_manufacturer <-
    perm_vec_var(levels_manufacturer, origin_num_delay_manufacturer)
  perm_ndelay_vec_manu_model <-
    perm_vec_var(levels_manu_model, origin_num_delay_manu_model)
  perm_ndelay_vec_dest <-
    perm_vec_var(levels_dest, origin_num_delay_dest)
  perm_ndelay_vec_seats <-
    perm_vec_var(levels_seats, origin_num_delay_seats)
  
  create_perm_ndelay_vec_var <-
    function(flights_full_perm,
             var_name,
             levels_var,
             origin_num_delay_var,
             perm_vec_var) {
      perm_num_delay_var <-
        sapply(
          levels_var,
          simplify = TRUE,
          FUN = function(one_level) {
            df_summarize <-
              flights_full_perm %>% filter((!!sym(var_name)) == one_level) %>% group_by(dep_delay, .drop =
                                                                                          FALSE) %>% tally
            number_delay_in_level <-
              df_summarize$n[df_summarize$dep_delay == 1]
            number_delay_in_level
          }
        )
      ind_greater <- which(perm_num_delay_var >= origin_num_delay_var)
      perm_vec_var[ind_greater] <- perm_vec_var[ind_greater] + 1
      perm_vec_var
    }
  
  num_perm <- 1000
  for (iter in 1:num_perm) {
    print(iter)
    flights_full_perm <-
      transform(flights_full_arranged_25, dep_delay = sample(dep_delay))
    perm_ndelay_vec_model <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'model',
        levels_model,
        origin_num_delay_model,
        perm_ndelay_vec_model
      )
    perm_ndelay_vec_manufacturer <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'manufacturer',
        levels_manufacturer,
        origin_num_delay_manufacturer,
        perm_ndelay_vec_manufacturer
      )
    perm_ndelay_vec_manu_model <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'manu_model',
        levels_manu_model,
        origin_num_delay_manu_model,
        perm_ndelay_vec_manu_model
      )
    perm_ndelay_vec_dest <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'dest',
        levels_dest,
        origin_num_delay_dest,
        perm_ndelay_vec_dest
      )
    perm_ndelay_vec_seats <-
      create_perm_ndelay_vec_var(
        flights_full_perm,
        'seats',
        levels_seats,
        origin_num_delay_seats,
        perm_ndelay_vec_seats
      )
  }
  
  #save perm_ndelay_vec_var output from permutations to csv (as tibble data frmae)
  perm_ndelay_vec_model_df = tibble(name = names(perm_ndelay_vec_model), value = perm_ndelay_vec_model)
  write.table(
    perm_ndelay_vec_model_df ,
    file = "../output/perm_ndelay_merged_vec_model_df_25.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_manufacturer_df = tibble(name = names(perm_ndelay_vec_manufacturer),
                                           value = perm_ndelay_vec_manufacturer)
  write.table(
    perm_ndelay_vec_manufacturer_df ,
    file = "../output/perm_ndelay_merged_vec_manufacturer_df_25.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_manu_model_df = tibble(name = names(perm_ndelay_vec_manu_model),
                                         value = perm_ndelay_vec_manu_model)
  write.table(
    perm_ndelay_vec_manu_model_df ,
    file = "../output/perm_ndelay_merged_vec_manu_model_df_25.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_dest_df = tibble(name = names(perm_ndelay_vec_dest), value = perm_ndelay_vec_dest)
  write.table(
    perm_ndelay_vec_dest_df ,
    file = "../output/perm_ndelay_merged_vec_dest_df_25.csv",
    sep = ",",
    row.names = FALSE
  )
  perm_ndelay_vec_seats_df = tibble(name = names(perm_ndelay_vec_seats), value = perm_ndelay_vec_seats)
  write.table(
    perm_ndelay_vec_seats_df ,
    file = "../output/perm_ndelay_merged_vec_seats_df_25.csv",
    sep = ",",
    row.names = FALSE
  )
}



num_perm <- 1000
perm_result_var_df <- function(var_name) {
  perm_ndelay_var_df <-
    read.table(
      file = paste(
        "../output/perm_ndelay_merged_vec_",
        var_name,
        "_df_25.csv",
        sep = ""
      ),
      sep = ",",
      header = TRUE
    )
  colnames(perm_ndelay_var_df)[1] <- var_name
  perm_ndelay_var_df$p_val <- perm_ndelay_var_df$value / num_perm
  perm_ndelay_var_df$p_adj <-
    p.adjust(perm_ndelay_var_df$p_val, method = "BH")
  
  flights_counts_var_df <-
    flights_full_arranged_25 %>% group_by((!!sym(var_name))) %>% summarise(total_counts =
                                                                             n())
  perm_ndelay_var_df <-
    merge(perm_ndelay_var_df, flights_counts_var_df, by = var_name) #add flights counts per level
  
  perm_ndelay_var_df <- perm_ndelay_var_df %>% arrange(p_adj)
  
  p_val_plot <-
    ggplot(perm_ndelay_var_df, aes(x = reorder((!!sym(
      var_name
    )), p_adj) , y = p_adj)) + geom_bar(stat = "identity") +
    labs(
      title = paste("p-value adjusted per", var_name, "level"),
      x = var_name,
      y = "p-value adjusted"
    ) +
    theme(
      plot.title = element_text(
        hjust = 0.5,
        size = 19,
        face = "bold"
      ),
      axis.text.x = element_text(
        angle = 90,
        vjust = 0.5,
        hjust = 1,
        size = 6
      )
    ) +
    geom_hline(aes(yintercept = 0.05, color = "0.05"),
               linetype = "dashed",
               size = 1.5) +
    scale_color_manual(name = "FDR cutoff", values = c("0.05" = "red"))
  
  flights_counts_plot <-
    ggplot(perm_ndelay_var_df,  aes(x = reorder((!!sym(
      var_name
    )), p_adj) , y = total_counts)) +
    geom_bar(stat = "identity") + labs(
      title = paste("flights counts per", var_name, "level"),
      x = var_name,
      y = "flights counts"
    ) +
    theme(
      plot.title = element_text(
        hjust = 0.5,
        size = 19,
        face = "bold"
      ),
      axis.text.x = element_text(
        angle = 90,
        vjust = 0.5,
        hjust = 1,
        size = 6
      )
    )
  
  return(list(perm_ndelay_var_df, p_val_plot, flights_counts_plot))
}

#permutation - model
perm_model_df <- perm_result_var_df("model")
perm_ndelay_model_df <- perm_model_df[[1]]


#permutation - manufacturer
perm_manufacturer_df <- perm_result_var_df("manufacturer")
perm_ndelay_manufacturer_df <- perm_manufacturer_df[[1]]

#permutation - manu_model
perm_manu_model_df <- perm_result_var_df("manu_model")
perm_ndelay_manu_model_df <- perm_manu_model_df[[1]]


#permutation - dest
perm_dest_df <- perm_result_var_df("dest")
perm_ndelay_dest_df <- perm_dest_df[[1]]


#permutation - seats
perm_seats_df <- perm_result_var_df("seats")
perm_ndelay_seats_df <- perm_seats_df[[1]]


#change model levels in flights_full_arranged_25
flights_full_arranged_25$model <-
  as.character(flights_full_arranged_25$model)
models_to_change <-
  perm_ndelay_model_df$model[which(perm_ndelay_model_df$p_adj <= 0.05)] #filtered levels (were kept)

flights_full_arranged_25 <-
  flights_full_arranged_25 %>% mutate(model = replace(
    model,
    !(flights_full_arranged_25$model %in% models_to_change),
    'other_model'
  ))

flights_full_arranged_25$model <-
  as.factor(flights_full_arranged_25$model) #convert back to factor variable


#change model levels into  one-letter categories
flights_full_arranged_25_bf_change <- flights_full_arranged_25
model_levels_after_change <-
  data.frame(model = levels(flights_full_arranged_25$model))
model_levels_after_change$new_name <-
  letters[1:nrow(model_levels_after_change)]
model_levels_after_change$new_name[model_levels_after_change$model == "other_model"] <-
  "other"
model_letter_cat <-
  lapply(flights_full_arranged_25$model, function(row) {
    y <-
      model_levels_after_change$new_name[which(row == model_levels_after_change$model)]
    y
  })
flights_full_arranged_25$model <- as.factor(unlist(model_letter_cat))



#change manufacturer levels in flights_full_arranged_25
flights_full_arranged_25$manufacturer <-
  as.character(flights_full_arranged_25$manufacturer)
manufacturers_to_change <-
  perm_ndelay_manufacturer_df$manufacturer[which(perm_ndelay_manufacturer_df$p_adj <=
                                                   0.05)] #filtered levels (were kept)

flights_full_arranged_25 <-
  flights_full_arranged_25 %>% mutate(manufacturer = replace(
    manufacturer,
    !(
      flights_full_arranged_25$manufacturer %in% manufacturers_to_change
    ),
    'other_manufacturer'
  ))

flights_full_arranged_25$manufacturer <-
  as.factor(flights_full_arranged_25$manufacturer) #convert back to factor variable


#change manu_model levels in flights_full_arranged_25
flights_full_arranged_25$manu_model <-
  as.character(flights_full_arranged_25$manu_model)
manu_models_to_change <-
  perm_ndelay_manu_model_df$manu_model[which(perm_ndelay_manu_model_df$p_adj <=
                                               0.05)] #filtered levels (were kept)

flights_full_arranged_25 <-
  
  flights_full_arranged_25 %>% mutate(manu_model = replace(
    manu_model,
    !(
      flights_full_arranged_25$manu_model %in% manu_models_to_change
    ),
    'other_manu_model'
  ))

flights_full_arranged_25$manu_model <-
  as.factor(flights_full_arranged_25$manu_model) #convert back to factor variable


#change manu_model column into  one-letter categories
flights_full_arranged_25_bf_change <- flights_full_arranged_25
manu_model_levels_after_change <-
  data.frame(manu_model = levels(flights_full_arranged_25$manu_model))
manu_model_levels_after_change$new_name <-
  letters[1:nrow(manu_model_levels_after_change)]
manu_model_levels_after_change$new_name[manu_model_levels_after_change$manu_model ==
                                          "other_manu_model"] <- "other"
manu_model_letter_cat <-
  lapply(flights_full_arranged_25$manu_model, function(row) {
    y <-
      manu_model_levels_after_change$new_name[which(row == manu_model_levels_after_change$manu_model)]
    y
  })
flights_full_arranged_25$manu_model <-
  as.factor(unlist(manu_model_letter_cat))


#change destination levels in flights_full_arranged_25
flights_full_arranged_25$dest <-
  as.character(flights_full_arranged_25$dest)
dests_to_change <-
  perm_ndelay_dest_df$dest[which(perm_ndelay_dest_df$p_adj <= 0.05)] #filtered levels (were kept)

flights_full_arranged_25 <-
  flights_full_arranged_25 %>% mutate(dest = replace(
    dest,
    !(flights_full_arranged_25$dest %in% dests_to_change),
    'other_dest'
  ))

flights_full_arranged_25$dest <-
  as.factor(flights_full_arranged_25$dest) #convert back to factor variable



#change destination column into number categories
dest_levels_after_change <-
  data.frame(dest = levels(flights_full_arranged_25$dest))
dest_levels_after_change$new_name <- 1:nrow(dest_levels_after_change)
dest_number_cat <-
  lapply(flights_full_arranged_25$dest, function(row) {
    y <- which(row == dest_levels_after_change$dest)
    y
  })
flights_full_arranged_25$dest <- as.factor(unlist(dest_number_cat))


#change seats levels in flights_full_arranged_25
flights_full_arranged_25$seats <-
  as.character(flights_full_arranged_25$seats)
seats_to_change <-
  perm_ndelay_seats_df$seats[which(perm_ndelay_seats_df$p_adj <= 0.05)] #filtered levels (were kept)

flights_full_arranged_25 <-
  flights_full_arranged_25 %>% mutate(seats = replace(
    seats,
    !(flights_full_arranged_25$seats %in% seats_to_change),
    'other_seats'
  ))

flights_full_arranged_25$seats <-
  as.factor(flights_full_arranged_25$seats) #convert back to factor variable

Inference

Removing Irrelevant Predicting Variables

We decided to exclude ‘model’ and ‘manufacturer’ variables from our model analysis and instead we took into account the ‘manu_model’ variable. As we discussed above we combined models and manufacturers into one column representing unique models according to their manufacturer. This step prevented future mistakenly grouping of planes with identical model name but different manufacturer. As we prefer to reduce the number of predicting variables in our model to prevent confusion we removed ‘model’ and ‘manufacturer’, and let the model be based on ‘mau_model’ variable.

## [1] 24
##    dest origin month.x sched_dep_time dep_delay sched_arr_time carrier distance
## 1:    1    JFK       5           1201         0           1388      B6     1826
## 2:    1    JFK       9           1201         0           1368      B6     1826
## 3:    1    JFK      10           1199         0           1366      B6     1826
## 4:    1    JFK      10           1199         0           1366      B6     1826
## 5:    1    JFK      11           1200         0           1383      B6     1826
## 6:   36    JFK       7            729         0            796      B6      199
##    year.y                    type engines seats    engine  temp humid wind_dir
## 1:   2004 Fixed wing multi engine       2 other Turbo-fan 57.92 69.52        S
## 2:   2004 Fixed wing multi engine       2 other Turbo-fan 62.96 70.07      ENE
## 3:   2006 Fixed wing multi engine       2 other Turbo-fan 60.08 74.56        S
## 4:   2004 Fixed wing multi engine       2 other Turbo-fan 53.06 48.51       SW
## 5:   2011 Fixed wing multi engine       2 other Turbo-fan 51.98 36.31        W
## 6:   2005 Fixed wing multi engine       2 other Turbo-fan 82.94 69.53       SW
##    wind_speed precip pressure visib            tzone w_day week_num manu_model
## 1:    6.90468      0   1021.0    10   America/Denver   Mon       20      other
## 2:    3.45234      0   1024.8    10   America/Denver   Sat       38      other
## 3:   11.50780      0   1009.3    10   America/Denver   Sat       41      other
## 4:   20.71404      0   1014.5    10   America/Denver   Sat       42      other
## 5:   18.41248      0   1016.7    10   America/Denver   Sun       44      other
## 6:   12.65858      0   1003.7    10 America/New_York   Tue       29      other
ncol(flights_full_arranged_15)
## [1] 24
ncol(flights_full_arranged_25)
## [1] 24

Train-Test Split

Split the balanced data to train and test (80/20)

set.seed(45)
train_test_split <- function(flights_data) {
  sample = sample.split(flights_data$dep_delay, SplitRatio = .8)
  train = subset(flights_data, sample == TRUE)
  test  = subset(flights_data, sample == FALSE)
  train_test <- list(train, test)
  return(train_test)
}
train_test_20 <- train_test_split(flights_full_arranged)
train<- train_test_20[[1]]
test <- train_test_20[[2]]
dim(train)
## [1] 210090     24
dim(test)
## [1] 52523    24

There is a need to balance the train dataset because the amount of delays is ~20%. Imbalanced data can cause the model to be very good at learning the class that is the majority, but it will not learn to predict the minority class at the same level. There are mainly two ways to handle an imbalanced dataset. 1. Random oversampling- duplicates examples from the minority class in the training dataset and can result in overfitting for some models. 2. Random undersampling- delete examples from the majority class and can result in losing information invaluable to a model. We are using ovun.sample function from the ROSE package, which creates possibly balanced samples by a combination of random over-sampling minority examples and under-sampling majority examples.

get_balanced_df <- function(train, method = "both") {
  balanced.data <-
    ovun.sample(
      dep_delay ~ .,
      train,
      method = method,
      p = 0.5,
      seed = 1996
    )$data
  return(balanced.data)
}
#balanced train data
 train <- get_balanced_df(train)
 table(train$dep_delay)
## 
##     0     1 
## 85915 86651

After balancing the train data:

Sensitivity check - splitting into train-test data sets and balancing the train data:

#threshold = 15
train_test_15 <- train_test_split(flights_full_arranged_15) #divide to train and test

train_15_splitted<- train_test_15[[1]] #train data
train_15 <- get_balanced_df(train_15_splitted)#balance the data

test_15_splitted<- train_test_15[[2]] #test data
test_15 <- get_balanced_df(test_15_splitted) #balance the data


#threshold = 25
train_test_25 <- train_test_split(flights_full_arranged_25) #divide to train and test

train_25_splitted<- train_test_25[[1]] #train data
train_25 <- get_balanced_df(train_25_splitted)#balance the data

test_25_splitted<- train_test_25[[2]] #test data
test_25 <- get_balanced_df(test_25_splitted) #balance the data

Classification Decision Tree

Before running predictive model, we would like to choose best parameters for optimizing the model. Thus, we used the machine learning ‘mlr’ R package that supplies tools to train several different models and to perform tuning for hyperparameters. Using this tools we can view each hyperparameter impact on the performance of the model.In ‘rpart’ decision tree library, we control the parameters using the rpart.control() function. We chose the following hyperparameters to include in our model: minsplit - the minimum number of observations that must exist in a node in order for a split to be attempted. maxdepth - the maximum depth of any node of the final tree. *complexity parameter (cp) - the amount of improved relative error when splitting the node. For example, splitting the original root node decreases the relative error from 1.0 to 0.5, so the CP of the root node is 0.5. (Elsayed, 2019)

We created a grid of the hyperparameters to iterate on. For each iteration using cross validation method we measured the accuracy to evaluate the results of each parameters combination .

train_df<-as.data.frame(train)
test_df<-as.data.frame(test)

model.mlr <- makeClassifTask(
  data=train_df, 
  target="dep_delay"
)

# Search Parameter for Max Depth
param_grid <- makeParamSet( 
  makeDiscreteParam("maxdepth", values=seq(19,25, by=3)),
  makeDiscreteParam("cp", values=c(0.00005, 0.0001,0.0002,0.0003, 0.0004)),
  makeDiscreteParam("minsplit", values=seq(19,25, by=3))
)

# Define Grid
control_grid = makeTuneControlGrid()

# Define Cross Validation
resample = makeResampleDesc("CV", iters = 3L)

# Define Measure
measure = acc

#set.seed(123) 
dt_tuneparam <- tuneParams(learner='classif.rpart', 
                           task=model.mlr, 
                           resampling = resample,
                           measures = measure,
                           par.set=param_grid, 
                           control=control_grid, 
                           show.info = TRUE)

dt_tuneparam
## Tune result:
## Op. pars: maxdepth=25; cp=5e-05; minsplit=19
## acc.test.mean=0.7512314
result_hyperparam <- generateHyperParsEffectData(dt_tuneparam, partial.dep = TRUE)

Plot of mean accuracy test as a function of cp parameter during tuning parameters

ggplot(
  data = result_hyperparam$data,
  aes(x = cp, y=acc.test.mean)
) + geom_line(color = 'darkblue') +labs(title="Test accuracy mean per complexity parameter")+
theme(plot.title = element_text(hjust = 0.5, face = "bold"))

best_parameters_multi = setHyperPars(
  makeLearner("classif.rpart", predict.type = "prob"), 
  par.vals = dt_tuneparam$x
)

result_hyperparam.multi <- generateHyperParsEffectData(dt_tuneparam, partial.dep = TRUE)

Plot of accuracy per parameters’ values according to tuning parameters process

full_sample_param<-result_hyperparam.multi$data
result_sample <- result_hyperparam.multi$data %>%
  sample_n(nrow(full_sample_param))
hyperparam.plot <- plot_ly(result_sample, 
                           x = ~cp, 
                           y = ~maxdepth, 
                           z = ~minsplit,
                           marker = list(color = ~acc.test.mean,  colorscale = list(c(0, 1), c("darkred", "darkgreen")), showscale = TRUE))
hyperparam.plot <- hyperparam.plot %>% add_markers() %>% layout(title="Tuning hyperparameters results")
hyperparam.plot

parameters tuning results showed that cp parameter is most affecting the accuracy results on validation set - specifically low cp leads to better mean accuracy of the validation data set. As a result we conducted addition check to evaluate train and test accuracy with critical values of ‘cp’ parameter.Thus we ran our classification model once with cp=0.00005 and once with cp=0.0001 in order to decide which value yields better and similar train and test accuracy. Large difference between train and test accuracies may indicates on overfitting - a situation when model learned too much the training dataset such that it doesn’t predict good observations that were previously unseen.

In order to evaluate the model we would like to estimate the accuracy - a metric used in classification problems to tell the percentage of accurate predictions. Accuracy is calculated by dividing the number of correct predictions (true positive and true negative) by the total number of samples obtained from confusion matrix.

Train and test accuracy check for cp=0.00005:

#check accuracy train and test for cp=0.00005
overfit_class = rpart(dep_delay~ .,   data=train, method = 'class', control = c(cp=0.00005), maxdepth = dt_tuneparam$x$maxdepth, minsplit = dt_tuneparam$x$minsplit)

#predict train
predicted_class_train_overfit<- predict(overfit_class, train, type = 'class')
accuracy(train$dep_delay, predicted_class_train_overfit)
## [1] 0.825122
#predict test
predicted_class_test_overfit<- predict(overfit_class, test, type = 'class')
accuracy(test$dep_delay, predicted_class_test_overfit)
## [1] 0.6831293

The results obtained for cp=0.00005 may indicates on overfitting since there is difference of almost 0.1 in accuracy value (difference approximately 15% of the test accuracy value). Thus we decided to choose higher ‘cp’ value (cp=0.0001) which preserves an accuracy test in approximate to accuracy test obtained in cp=0.00005.

Train and test accuracy check for cp=0.0001:

#check accuracy train and test for cp=0.0001
best_class = rpart(dep_delay~ .,   data=train, method = 'class', control = c(cp=0.0001), maxdepth = dt_tuneparam$x$maxdepth, minsplit = dt_tuneparam$x$minsplit)

#predict train
predicted_class_train_best<- predict(best_class, train, type = 'class')
accuracy_train_class_best_20<-accuracy(train$dep_delay, predicted_class_train_best)
accuracy_train_class_best_20
## [1] 0.7562324
#predict test
predicted_class_test_best<- predict(best_class, test, type = 'class')
accuracy_test_class_best_20<-accuracy(test$dep_delay, predicted_class_test_best)
accuracy_test_class_best_20
## [1] 0.6873751

Choosing cp=0.0001 shows similar results in train and test accuracy (difference around 6% of test accuracy value=0.7). Those similar results reflecting there is no overfitting in our model and the current parameters values comtributing to better classification of unseen flights.

Since a classification model with cp=0.0001 visually looked excessively branched, we plotted a tree with cp=0.0004 which is much less brunched and thus can serves as a visual demonstration (but not accurate) for our model and enables easily interpretation of the decision process in our classification model.

Visualiztion of the tree with cp=0.0004:

class_visual = rpart(dep_delay~ .,   data=train, method = 'class', control = c(cp=0.0004),  maxdepth = dt_tuneparam$x$maxdepth, minsplit = dt_tuneparam$x$minsplit)


split.fun <- function(x, labs, digits, varlen, faclen)
{
  # replace commas with spaces (needed for strwrap)
  labs <- gsub(",", " ", labs)
  for(i in 1:length(labs)) {
    # split labs[i] into multiple lines
    labs[i] <- paste(strwrap(labs[i], width = 15), collapse = "\n")
  }
  labs
}

rpart.plot(class_visual, split.fun = split.fun)

Another way to summarize model results is by calculating a confusion matrix which shows the number of correct negative predictions, the number of incorrect positive predictions, the number of incorrect negative prediction, and the number of correct positive predictions (Visa, Ramsay, Ralescu, & Van Der Knaap, 2011).

confusion_class_train <- confusionMatrix(predicted_class_train_best, train$dep_delay)
confusion_class_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 65203 21354
##          1 20712 65297
##                                           
##                Accuracy : 0.7562          
##                  95% CI : (0.7542, 0.7583)
##     No Information Rate : 0.5021          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5125          
##                                           
##  Mcnemar's Test P-Value : 0.001776        
##                                           
##             Sensitivity : 0.7589          
##             Specificity : 0.7536          
##          Pos Pred Value : 0.7533          
##          Neg Pred Value : 0.7592          
##              Prevalence : 0.4979          
##          Detection Rate : 0.3778          
##    Detection Prevalence : 0.5016          
##       Balanced Accuracy : 0.7562          
##                                           
##        'Positive' Class : 0               
## 
confusion_class_test <- confusionMatrix(predicted_class_test_best, test$dep_delay)
confusion_class_test
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 29291  3840
##          1 12580  6812
##                                           
##                Accuracy : 0.6874          
##                  95% CI : (0.6834, 0.6913)
##     No Information Rate : 0.7972          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2596          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6996          
##             Specificity : 0.6395          
##          Pos Pred Value : 0.8841          
##          Neg Pred Value : 0.3513          
##              Prevalence : 0.7972          
##          Detection Rate : 0.5577          
##    Detection Prevalence : 0.6308          
##       Balanced Accuracy : 0.6695          
##                                           
##        'Positive' Class : 0               
## 

Comparing classification decision tree (with threshold=20) results to sensitivity checks (tresholds 15 and 25) results

As mentioned earlier, we chose 20 minutes as a threshold to determine if a flight is considered delayed. We generated two additional datasets: one with a higher threshold (25 minutes) and one with a lower threshold (15 minutes) in order to test the sensitivity of our predictions.

Performing our model on these datasets:

Threshold = 15

class_15 <- rpart(dep_delay~ ., data=train_15, method = 'class', control = c(cp=0.0001), maxdepth = dt_tuneparam$x$maxdepth, minsplit = dt_tuneparam$x$minsplit) #run classification decision tree

pred_class_15 <- predict(class_15, test_15, type = 'class') #prediction
confusion_class_15 <- confusionMatrix(pred_class_15, test_15$dep_delay)
confusion_class_15
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     0
##          1 66860 17686
##          0 19055 68965
##                                          
##                Accuracy : 0.7871         
##                  95% CI : (0.7852, 0.789)
##     No Information Rate : 0.5021         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5741         
##                                          
##  Mcnemar's Test P-Value : 9.545e-13      
##                                          
##             Sensitivity : 0.7782         
##             Specificity : 0.7959         
##          Pos Pred Value : 0.7908         
##          Neg Pred Value : 0.7835         
##              Prevalence : 0.4979         
##          Detection Rate : 0.3874         
##    Detection Prevalence : 0.4899         
##       Balanced Accuracy : 0.7871         
##                                          
##        'Positive' Class : 1              
## 
accuracy_num_class_15<-accuracy(test_15$dep_delay, pred_class_15)

Threshold = 25

class_25 <- rpart(dep_delay~ ., data=train_25, method = 'class', control = c(cp=0.0001), maxdepth = dt_tuneparam$x$maxdepth, minsplit = dt_tuneparam$x$minsplit) #run classification decision tree

pred_class_25 <- predict(class_25, test_25, type = 'class') #prediction
confusion_class_25 <- confusionMatrix(pred_class_25, test_25$dep_delay)
confusion_class_25
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     0
##          1 66860 17686
##          0 19055 68965
##                                          
##                Accuracy : 0.7871         
##                  95% CI : (0.7852, 0.789)
##     No Information Rate : 0.5021         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5741         
##                                          
##  Mcnemar's Test P-Value : 9.545e-13      
##                                          
##             Sensitivity : 0.7782         
##             Specificity : 0.7959         
##          Pos Pred Value : 0.7908         
##          Neg Pred Value : 0.7835         
##              Prevalence : 0.4979         
##          Detection Rate : 0.3874         
##    Detection Prevalence : 0.4899         
##       Balanced Accuracy : 0.7871         
##                                          
##        'Positive' Class : 1              
## 
accuracy_num_class_25<-accuracy(test_25$dep_delay, pred_class_25)

A visualization of all test confusion matrices (of datasets based on threshold=20, threshold=15 and threshold=25 defining delayed flights):

Plot each predictor’s importance in simple classification tree model when the threshold=20:

best_class$variable.importance %>%
  data.frame() %>%
  rownames_to_column(var = "Feature") %>%
  rename(Overall = '.') %>%
  ggplot(aes(x = fct_reorder(Feature, Overall), y = Overall)) +
  geom_pointrange(aes(ymin = 0, ymax = Overall), color = "cadetblue", size = .3) +
  theme_minimal() +
  coord_flip() +
  labs(x = "", y = "", title = "Variable Importance with Simple Classication (20)")+
  theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))

Plot each predictor’s importance in simple classification tree model in case of threshold=15:

class_15$variable.importance %>%
  data.frame() %>%
  rownames_to_column(var = "Feature") %>%
  rename(Overall = '.') %>%
  ggplot(aes(x = fct_reorder(Feature, Overall), y = Overall)) +
  geom_pointrange(aes(ymin = 0, ymax = Overall), color = "cadetblue", size = .3) +
  theme_minimal() +
  coord_flip() +
  labs(x = "", y = "", title = "Variable Importance with Simple Classication (15)")+
  theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))

Plot each predictor’s importance in simple classification tree model in case of threshold=25:

class_25$variable.importance %>%
  data.frame() %>%
  rownames_to_column(var = "Feature") %>%
  rename(Overall = '.') %>%
  ggplot(aes(x = fct_reorder(Feature, Overall), y = Overall)) +
  geom_pointrange(aes(ymin = 0, ymax = Overall), color = "cadetblue", size = .3) +
  theme_minimal() +
  coord_flip() +
  labs(x = "", y = "", title = "Variable Importance with Simple Classication (25)")+
  theme(plot.title = element_text(hjust = 0.5, size = 19, face = "bold"))

Random Forest Model

The decision tree algorithm is relatively easy to understand and interpret. However, a single tree seems insufficient for producing effective results. We chose to run a Random Forest algorithm as it leverages the power of multiple decision trees. It does not rely on the feature importance given by a single decision tree but chooses features randomly during the training process. Therefore, the random forest can generalize the data in a better way. This randomized feature selection makes Random Forest much more accurate than a decision tree.

Run Random Forest on the train set

run_random <- function(train, ntrees=1000, depth = NULL) {
  fit <- ranger(
    dep_delay ~ .,
    data = train,
    num.trees = ntrees,
    importance = 'impurity',
    max.depth = depth,
    verbose = TRUE,
    seed = 26
    
  )
  return(fit)
}
rf <- run_random(train)
## Growing trees.. Progress: 30%. Estimated remaining time: 1 minute, 13 seconds.
## Growing trees.. Progress: 58%. Estimated remaining time: 45 seconds.
## Growing trees.. Progress: 85%. Estimated remaining time: 16 seconds.
rf
## Ranger result
## 
## Call:
##  ranger(dep_delay ~ ., data = train, num.trees = ntrees, importance = "impurity",      max.depth = depth, verbose = TRUE, seed = 26) 
## 
## Type:                             Classification 
## Number of trees:                  1000 
## Sample size:                      172566 
## Number of independent variables:  23 
## Mtry:                             4 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             5.95 %

We can see that the OOB prediction error is very low, which means the estimated accuracy is high!

Prediction of the test data set:

#train accuracy
confusion_rf_train <- table(train$dep_delay, rf$predictions)
confusion_rf_train
##    
##         0     1
##   0 78596  7319
##   1  2944 83707
confusionMatrix(confusion_rf_train)
## Confusion Matrix and Statistics
## 
##    
##         0     1
##   0 78596  7319
##   1  2944 83707
##                                           
##                Accuracy : 0.9405          
##                  95% CI : (0.9394, 0.9416)
##     No Information Rate : 0.5275          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.881           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9639          
##             Specificity : 0.9196          
##          Pos Pred Value : 0.9148          
##          Neg Pred Value : 0.9660          
##              Prevalence : 0.4725          
##          Detection Rate : 0.4555          
##    Detection Prevalence : 0.4979          
##       Balanced Accuracy : 0.9417          
##                                           
##        'Positive' Class : 0               
## 
#test accuracy
test <- test %>% drop_na(year.y, temp, humid, wind_dir, wind_speed, pressure)
pred <- predict(rf, test)
confusion <- table(test$dep_delay, pred$predictions)
confusionMatrix(confusion)
## Confusion Matrix and Statistics
## 
##    
##         0     1
##   0 29525  5367
##   1  4104  4110
##                                           
##                Accuracy : 0.7803          
##                  95% CI : (0.7763, 0.7842)
##     No Information Rate : 0.7801          
##     P-Value [Acc > NIR] : 0.4749          
##                                           
##                   Kappa : 0.3273          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8780          
##             Specificity : 0.4337          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.5004          
##              Prevalence : 0.7801          
##          Detection Rate : 0.6849          
##    Detection Prevalence : 0.8094          
##       Balanced Accuracy : 0.6558          
##                                           
##        'Positive' Class : 0               
## 

As we can see we have high accuracy for predicting delay!

Running the Random Forest model followed with sensitivity check:

We applied the current Random Forest model on a data based on 20 minutes threshold as definition for delayed flight. Now we will on the sensitivity check dtasets we generated, with 25 and 15 minutes threshold, as we performed in simple classification decision tree.

Results of random forest model on these datasets:

Threshold = 15

rf_15 <- run_random(train_15) #run random forest
## Growing trees.. Progress: 29%. Estimated remaining time: 1 minute, 17 seconds.
## Growing trees.. Progress: 57%. Estimated remaining time: 46 seconds.
## Growing trees.. Progress: 86%. Estimated remaining time: 15 seconds.
rf_15
## Ranger result
## 
## Call:
##  ranger(dep_delay ~ ., data = train, num.trees = ntrees, importance = "impurity",      max.depth = depth, verbose = TRUE, seed = 26) 
## 
## Type:                             Classification 
## Number of trees:                  1000 
## Sample size:                      172566 
## Number of independent variables:  23 
## Mtry:                             4 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             3.97 %
test_15 <- test_15 %>% drop_na(year.y, temp, humid, wind_dir, wind_speed, pressure)
pred_15 <- predict(rf_15, test_15) #prediction

#train 15 accuracy
confusion_15_train <- table(train_15$dep_delay, rf_15$predictions)
confusion_15_train
##    
##         1     0
##   1 83047  2868
##   0  3980 82671
confusionMatrix(confusion_15_train)
## Confusion Matrix and Statistics
## 
##    
##         1     0
##   1 83047  2868
##   0  3980 82671
##                                           
##                Accuracy : 0.9603          
##                  95% CI : (0.9594, 0.9612)
##     No Information Rate : 0.5043          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9206          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9543          
##             Specificity : 0.9665          
##          Pos Pred Value : 0.9666          
##          Neg Pred Value : 0.9541          
##              Prevalence : 0.5043          
##          Detection Rate : 0.4812          
##    Detection Prevalence : 0.4979          
##       Balanced Accuracy : 0.9604          
##                                           
##        'Positive' Class : 1               
## 
#test 15 accuracy
confusion_15 <- table(test_15$dep_delay, pred_15$predictions)
confusion_15
##    
##         1     0
##   1 85911     4
##   0    15 86636
confusionMatrix(confusion_15)
## Confusion Matrix and Statistics
## 
##    
##         1     0
##   1 85911     4
##   0    15 86636
##                                           
##                Accuracy : 0.9999          
##                  95% CI : (0.9998, 0.9999)
##     No Information Rate : 0.5021          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9998          
##                                           
##  Mcnemar's Test P-Value : 0.02178         
##                                           
##             Sensitivity : 0.9998          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9998          
##              Prevalence : 0.4979          
##          Detection Rate : 0.4978          
##    Detection Prevalence : 0.4979          
##       Balanced Accuracy : 0.9999          
##                                           
##        'Positive' Class : 1               
## 

Threshold = 25

rf_25 <- run_random(train_25) #run random forest
## Growing trees.. Progress: 20%. Estimated remaining time: 2 minutes, 4 seconds.
## Growing trees.. Progress: 41%. Estimated remaining time: 1 minute, 29 seconds.
## Growing trees.. Progress: 62%. Estimated remaining time: 56 seconds.
## Growing trees.. Progress: 81%. Estimated remaining time: 28 seconds.
## Growing trees.. Progress: 99%. Estimated remaining time: 2 seconds.
rf_25
## Ranger result
## 
## Call:
##  ranger(dep_delay ~ ., data = train, num.trees = ntrees, importance = "impurity",      max.depth = depth, verbose = TRUE, seed = 26) 
## 
## Type:                             Classification 
## Number of trees:                  1000 
## Sample size:                      172566 
## Number of independent variables:  23 
## Mtry:                             4 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             3.97 %
test_25 <- test_25 %>% drop_na(year.y, temp, humid, wind_dir, wind_speed, pressure)
pred_25 <- predict(rf_25, test_25) #prediction

#train 25 accuracy
confusion_25_train <- table(train_25$dep_delay, rf_25$predictions)
confusion_25_train
##    
##         1     0
##   1 83047  2868
##   0  3980 82671
confusionMatrix(confusion_25_train)
## Confusion Matrix and Statistics
## 
##    
##         1     0
##   1 83047  2868
##   0  3980 82671
##                                           
##                Accuracy : 0.9603          
##                  95% CI : (0.9594, 0.9612)
##     No Information Rate : 0.5043          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9206          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9543          
##             Specificity : 0.9665          
##          Pos Pred Value : 0.9666          
##          Neg Pred Value : 0.9541          
##              Prevalence : 0.5043          
##          Detection Rate : 0.4812          
##    Detection Prevalence : 0.4979          
##       Balanced Accuracy : 0.9604          
##                                           
##        'Positive' Class : 1               
## 
#test 25 accuracy
confusion_25 <- table(test_25$dep_delay, pred_25$predictions)
confusion_25
##    
##         1     0
##   1 85911     4
##   0    16 86635
confusionMatrix(confusion_25)
## Confusion Matrix and Statistics
## 
##    
##         1     0
##   1 85911     4
##   0    16 86635
##                                           
##                Accuracy : 0.9999          
##                  95% CI : (0.9998, 0.9999)
##     No Information Rate : 0.5021          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9998          
##                                           
##  Mcnemar's Test P-Value : 0.01391         
##                                           
##             Sensitivity : 0.9998          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9998          
##              Prevalence : 0.4979          
##          Detection Rate : 0.4978          
##    Detection Prevalence : 0.4979          
##       Balanced Accuracy : 0.9999          
##                                           
##        'Positive' Class : 1               
## 

A visualization of the confusion matrices:

Lastly, we produced the importance of each predictor:

get_importance <- function(rf) {
  imps <- data.frame(
    var = names(train)[-5],
    imps = rf$variable.importance / max(rf$variable.importance)
  )
  return(imps)
}
imps <- get_importance(rf)
imps_15 <- get_importance(rf_15)
imps_25 <- get_importance(rf_25)

Plot of predictors’ importance:

Check for overfit with twice the number of trees and a limited depth for the trees:

rf_check <- run_random(train, ntrees = 2000, depth = 10)
## Growing trees.. Progress: 35%. Estimated remaining time: 56 seconds.
## Growing trees.. Progress: 71%. Estimated remaining time: 24 seconds.
#train accuracy
confusion_rf_train_check <- table(train$dep_delay, rf_check$predictions)
confusion_rf_train_check
##    
##         0     1
##   0 59934 25981
##   1 25569 61082
confusionMatrix(confusion_rf_train_check)
## Confusion Matrix and Statistics
## 
##    
##         0     1
##   0 59934 25981
##   1 25569 61082
##                                           
##                Accuracy : 0.7013          
##                  95% CI : (0.6991, 0.7034)
##     No Information Rate : 0.5045          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.4025          
##                                           
##  Mcnemar's Test P-Value : 0.07026         
##                                           
##             Sensitivity : 0.7010          
##             Specificity : 0.7016          
##          Pos Pred Value : 0.6976          
##          Neg Pred Value : 0.7049          
##              Prevalence : 0.4955          
##          Detection Rate : 0.3473          
##    Detection Prevalence : 0.4979          
##       Balanced Accuracy : 0.7013          
##                                           
##        'Positive' Class : 0               
## 
#test accuracy
pred_check <- predict(rf_check, test)
confusion_check <- table(test$dep_delay, pred_check$predictions)
confusionMatrix(confusion_check)
## Confusion Matrix and Statistics
## 
##    
##         0     1
##   0 24099 10793
##   1  2766  5448
##                                          
##                Accuracy : 0.6854         
##                  95% CI : (0.681, 0.6898)
##     No Information Rate : 0.6232         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.2577         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.8970         
##             Specificity : 0.3354         
##          Pos Pred Value : 0.6907         
##          Neg Pred Value : 0.6633         
##              Prevalence : 0.6232         
##          Detection Rate : 0.5591         
##    Detection Prevalence : 0.8094         
##       Balanced Accuracy : 0.6162         
##                                          
##        'Positive' Class : 0              
## 
# plot confusion matrix
plot(
  confusion_check,
  xlab = 'Predicted',
  ylab = 'True',
  main = sprintf(
    'Predicted vs Obsvered
 OOB prediction error: %s',
    round(rf_check$prediction.error, 3)
  )
)

imps_check <- get_importance(rf_check)
imps_check %>%
  ggplot(aes(imps, x = reorder(var, imps))) +
  geom_point(size = 3, colour = "#ff6767") +
  coord_flip() +
  labs(x = "Predictors", y = "Importance scores") +
  ggtitle('Importance') +
  theme_bw(18)

Discussion

We pre-processed the data to optimize each parameter’s potential to be helpful in prediction. We used permutation tests to determine which of the levels in each categorical variable with many levels are significant in predicting delayed flights. Part of the pre-processing included a Pearson correlation test to find numeric variables that are highly correlated to reduce the number of variables to include in our models. We converted the variable dep_delay from a delay in minutes to a binary variable. 0- no delay, 1- there is a delay. The division was made based on a threshold of 20 minutes late. In order to test the predictions’ sensitivity to this threshold, we created two additional datasets, one with a higher threshold (25 minutes) and one with a lower threshold (15 minutes), and trained the predictive models on these two datasets. Unfortunately, in both models, Decision Tree and Random Forest, we observed differences between the test accuracy of the primary dataset compared to the test accuracy of the datasets from the sensitivity checks. This means that the delay threshold value we set, dramatically affects the results. Note: both sensitivity checks resulted in the same results and were more accurate in prediction than the original dataset, raising our suspicion that these results are somehow skewed; this demands further examination.

At first, we trained a Classification Decision Tree model that reached a test accuracy score of approximately 0.7 in delay prediction. In order to build an optimal Classification Decision Tree model, we performed a tuning parameters process. This process yielded the values for the model that will result in the most accurate prediction. One of the most affecting hyper-parameters was the Complexity Parameter (CP); the slightest changes in the CP parameter resulted in dramatic differences in the test and train accuracy scores. We finally chose a CP value that resulted in a tree that is not overfitting the training data on the one hand, and on the other hand, maintains a relatively high test accuracy. We also examined whether a Random Forest model could learn the data better and predict more accurately. We assumed this model would perform better since Random Forests are considerably more robust than a single decision tree. They aggregate many decision trees to limit overfitting and error due to bias, and yield valuable results. The Random Forest model reached a test accuracy of 0.78. The training accuracy was higher, 0.94. This is an improvement from the Decision Tree model. In principle, the chance of an overfit in Random Forest is low thanks to bagging and random feature selection, it is trained independently on different subsets of the training data. However, the relatively big difference between the train and test accuracy scores raises our concern that the model may overfit the data. In order to examine the matter, we performed another run of the Random Forest model where we limited the maximum tree depth and increased the number of trees (from 1000 to 2000). The result of this run resulted in similar train and test accuracy scores (~0.7, and ~0.69). This means we eliminated the possibilty of overfitting, but recieved a result that is very close to the one produced bt the single Decision Tree model.

In conclusion, we have shown a proof of concept that it is possible to predict if there will be a delay in takeoff with a high probability using a Random Forest model that considers various parameters related to weather, aircraft characteristics, airport, destination, airlines, etc.

Future Research

For further research, it is worthwhile to implement the model on current databases for two reasons; The first is to verify the model and see that it is not adjusted only for 2013. Second, since the Corona epidemic, the world of flights has changed. The pandemic inflicted a heavy toll on global aviation, which resulted in ratings downgrades, liquidation, and bankruptcy of several airlines and airports due to severe cash burn instigated by travel restrictions. (Dube et al. 2021) In addition, our work divided the data into a binary form and used classification models to predict delayed or non-delayed flights. The downside of this approach is that the threshold set to determine if a flight was delayed affects the results. An improvement of the model can also include using a random forest that calculates regression and makes it possible to get an exact number of minutes of delay.

References

*Carvalho, Leonardo, Alice Sternberg, Leandro Maia Gonçalves, Ana Beatriz Cruz, Jorge A. Soares, Diego Brandão, Diego Carvalho, and Eduardo Ogasawara. 2021. “On the Relevance of Data Science for Flight Delay Research: A Systematic Review.” Transport Reviews 41 (4): 499–528.

*Elsayed, A. M. M. Decisions Tree Building for Different Types Data.2019

*Rebollo, Juan Jose, and Hamsa Balakrishnan. 2014. “Characterization and Prediction of Air Traffic Delays.” Transportation Research Part C: Emerging Technologies 44 (July): 231–41.

*Sternberg, Alice, Jorge Soares, Diego Carvalho, and Eduardo Ogasawara. 2017. “A Review on Flight Delay Prediction.” arXiv [cs.CY]. arXiv. http://arxiv.org/abs/1703.06118.

*Visa, Ramsay, Ralescu, and Van Der Knaap. n.d. “Confusion Matrix-Based Feature Selection.” MAICS. https://www.researchgate.net/profile/Jennifer-Seitzer-2/publication/220833258_Using_a_Genetic_Algorithm_to_Evolve_a_D_Search_Heuristic/links/545a2bea0cf2bccc49132577/Using-a-Genetic-Algorithm-to-Evolve-a-D-Search-Heuristic.pdf#page=126.

*Dube, Kaitano, Godwell Nhamo, and David Chikodzi. 2021. “COVID-19 Pandemic and Prospects for Recovery of the Global Aviation Industry.” Journal of Air Transport Management 92 (May): 102022.